Asynchronous system-on-a-chip interconnect

ABSTRACT

Methods and apparatus are described relating to a system-on-a-chip which includes a plurality of synchronous modules, each synchronous module having an associated clock domain characterized by a data rate, the data rates comprising a plurality of different data rates. The system-on-a-chip also includes a plurality of clock domain converters. Each clock domain converter is coupled to a corresponding one of the synchronous modules, and is operable to convert data between the clock domain of the corresponding synchronous module and an asynchronous domain characterized by transmission of data according to an asynchronous handshake protocol. An asynchronous crossbar is coupled to the plurality of clock domain converters, and is operable in the asynchronous domain to implement a first-in-first-out (FIFO) channel between any two of the clock domain converters, thereby facilitating communication between any two of the synchronous modules.

RELATED APPLICATION DATA

The present application is a continuation of and claims priority under35 U.S.C. 120 to U.S. patent application Ser. No. 10/634,597 forASYNCHRONOUS SYSTEM-ON-A-CHIP INTERCONNECT filed on Aug. 4, 2003(Attorney Docket No. FULCP009) which claims priority under 35 U.S.C.119(e) to U.S. Provisional Patent Application No. 60/444,820 forASYNCHRONOUS INTERCONNECT SYSTEM filed on Feb. 3, 2003 (Attorney DocketNo. FULCP009P), the entire disclosures of both of which are incorporatedherein by reference for all purposes. U.S. patent application Ser. No.10/634,597 is also a continuation-in-part of and claims priority under35 U.S.C. 120 to each of U.S. patent application Ser. No. 10/136,025 forASYNCHRONOUS CROSSBAR CIRCUIT WITH DETERMINISTIC OR ARBITRATED CONTROLfiled on Apr. 30, 2002 (Attorney Docket No. FULCP001), and U.S. patentapplication Ser. No. 10/212,574 for TECHNIQUES FOR FACILITATINGCONVERSION BETWEEN ASYNCHRONOUS AND SYNCHRONOUS DOMAINS filed on Aug. 1,2002 (Attorney Docket No. FULCP002), to which the present applicationalso claims priority on the same basis. The entire disclosures of bothof these applications are also incorporated herein by reference for allpurposes.

BACKGROUND OF THE INVENTION

The present invention relates to asynchronous digital circuit design andin particular to asynchronous circuitry for interconnecting a variety ofsystem resources in the context of a system-on-a-chip.

A so called “system-on-a-chip” (SOC) is typically designed with a numberof modules, each of which has its own clock. For example, such a systemmight include a memory controller, an I/O interface (e.g., PCI orHyperTransport), internal peripherals (e.g., SRAM or computing logic),computing resources (e.g., one or more CPUs), and some kind ofinterconnect for allowing the modules to interact with each other. In atypical SOC, the memory controller might operate at 300 MHz, the I/Ointerface at 400 MHz, the internal peripherals at 600 MHz, and each ofthe CPUs at 1.1 GHz. This makes it very difficult to implement anefficient interconnect solution.

Conventional approaches to this problem involve the use of a high speedsynchronous bus in which transmissions to and from the various systemmodules on the bus are synchronized. That is, such a bus typicallyemploys a clock signal the value of which is constrained by specificratios with the clock signals being synchronized. Not only is thissynchronization difficult to achieve, as soon as the performance of anyof the modules (which are typically associated with different vendors)changes, i.e., clock speed increases, the ratios of the synchronizationsolution no longer apply, and a completely new solution must beimplemented.

In view of the foregoing, it is desirable to provide an interconnectsolution for implementing SOCs which allows various system moduleshaving independent clock domains to communicate effectively andefficiently. It is also desirable that any such interconnect solution beflexible with regard to changes in individual system module performance.

SUMMARY OF THE INVENTION

According to the present invention, a system interconnect solution isprovided which is operable to interconnect system modules havingdifferent clock domains in a manner which is insensitive to variationsin system module performance.

According to various embodiments, the present invention provides methodsand apparatus relating to an integrated circuit which includes aplurality of synchronous modules, each synchronous module having anassociated clock domain characterized by a data rate, the data ratescomprising a plurality of different data rates. The integrated circuitalso includes a plurality of clock domain converters. Each clock domainconverter is coupled to a corresponding one of the synchronous modules,and is operable to convert data between the clock domain of thecorresponding synchronous module and an asynchronous domaincharacterized by transmission of data according to an asynchronoushandshake protocol. An asynchronous crossbar is coupled to the pluralityof clock domain converters, and is operable in the asynchronous domainto implement a first-in-first-out (FIFO) channel between any two of theclock domain converters, thereby facilitating communication between anytwo of the synchronous modules. According to a more specific embodimentof the invention, the integrated circuit includes at least one repeater.Each repeater is coupled between a selected one of the clock domainconverters and the asynchronous crossbar.

According to various implementations, the integrated circuit maycomprise a wide variety of system types including, for example,multi-processor data processing systems, and synchronous optical network(SONET) interconnect switches.

A further understanding of the nature and advantages of the presentinvention may be realized by reference to the remaining portions of thespecification and the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a Mueller consensus element.

FIG. 2 is a representation of a Q-way split.

FIG. 3 is a representation of a P-way merge.

FIG. 4 is a simplified representation of an asynchronous crossbar.

FIG. 5 is a schematic representation of a first portion of split bus.

FIG. 6 is a schematic representation of a second portion of a split bus.

FIG. 7 is a schematic representation of a first portion of a merge bus.

FIG. 8 is a schematic representation of a second portion of a merge bus.

FIG. 9 is a schematic representation of a first implementation of arouter cell.

FIG. 10 is a schematic representation of a second implementation of arouter cell.

FIG. 11 is a schematic representation of a third implementation of arouter cell.

FIG. 12 is a schematic representation of a fourth implementation of arouter cell

FIG. 13 is a representation of a dispatcher for use with any of avariety of crossbar circuits.

FIG. 14 is a representation of an output controller portion of adispatcher.

FIG. 15 is another representation of a dispatcher for use with any of avariety of crossbar circuits.

FIG. 16 is a representation of an arbiter for use with any of a varietyof crossbar circuits.

FIG. 17 is a schematic representation of an output controller portion ofan arbiter.

FIG. 18 is another representation of an arbiter for use with any of avariety of crossbar circuits.

FIG. 19 is a representation of a datapath crossbar.

FIGS. 20A-20C show crossbar circuits for use in implementing a crossbarusing various timing assumptions according to a specific embodiment ofthe invention.

FIG. 21 is a simplified block diagram of an asynchronous-to-synchronous(A2S) interface designed according to a specific embodiment of theinvention.

FIG. 22 is a simplified block diagram of a synchronous-to-asynchronous(S2A) interface designed according to a specific embodiment of theinvention.

FIG. 23 is a simplified block diagram of a burst mode A2S interfacedesigned according to a specific embodiment of the invention.

FIG. 24 is a simplified block diagram of a transfer token generationcircuit according to a specific embodiment of the invention.

FIG. 25 is a simplified block diagram of a transfer token distributioncircuit according to a specific embodiment of the invention.

FIG. 26 is a simplified block diagram of a burst mode S2A interfacedesigned according to a specific embodiment of the invention.

FIGS. 27-45 illustrate various components of specific implementations ofan A2S interface and an S2A interface according to various specificembodiments of the invention.

FIGS. 46-55 illustrate various components of specific implementations ofan A2S interface and an S2A interface according to various otherspecific embodiments of the invention.

FIGS. 56-64 illustrate various implementations of A2S and S2A burst-modeinterfaces according to specific embodiments of the invention.

FIGS. 65-69 illustrate various implementations of A2S and S2A burst-modeinterfaces according to other specific embodiments of the invention.

FIG. 70 is a block diagram illustrating a system interconnect solutionimplemented according to a specific embodiment of the invention.

FIG. 70A is a simplified circuit diagram of a repeater which may be usedwith various embodiments of the invention.

FIG. 71 is a block diagram of a crossbar and its control circuitry foruse with a specific embodiment of the invention.

FIG. 72 is a schematic diagram of a arbiter circuit for use with variousembodiments of the present invention.

FIG. 73 is a block diagram showing the inclusion of a rate throttlingcircuit in a particular implementation.

FIG. 74 is a block diagram of a clock domain converter according to aspecific embodiment.

FIG. 75 is a block diagram for illustrating the operation of abuilt-in-self-test protocol according to a specific embodiment of theinvention.

FIGS. 76A-76C illustrate exemplary test vector and transaction formatsaccording to a particular embodiment.

FIGS. 77 and 78 illustrate exemplary system-on-a-chip designsimplemented according to various embodiments of the present invention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

Reference will now be made in detail to specific embodiments of theinvention including the best modes contemplated by the inventors forcarrying out the invention. Examples of these specific embodiments areillustrated in the accompanying drawings. While the invention isdescribed in conjunction with these specific embodiments, it will beunderstood that it is not intended to limit the invention to thedescribed embodiments. On the contrary, it is intended to coveralternatives, modifications, and equivalents as may be included withinthe spirit and scope of the invention as defined by the appended claims.In the following description, numerous specific details are set forth inorder to provide a thorough understanding of the present invention. Thepresent invention may be practiced without some or all of these specificdetails. In addition, well known process operations have not beendescribed in detail in order not to unnecessarily obscure the presentinvention.

Asynchronous VLSI is an active area of research and development indigital circuit design. It refers to all forms of digital circuit designin which there is no global clock synchronization signal.Delay-insensitive asynchronous designs, by their very nature areinsensitive to the signal propagation delays which have become thesingle greatest obstacle to the advancement of traditional designparadigms. That is, delay-insensitive circuit design maintains theproperty that any transition in the digital circuit could have anunbounded delay and the circuit will still behave correctly. Thecircuits enforce sequencing but not absolute timing. This design styleavoids design and verification difficulties that arise from timingassumptions, glitches, or race conditions.

For background information regarding delay-insensitive asynchronousdigital design, please refer to the following papers: A. J. Martin,“Compiling Communicating Processes into Delay-Insensitive Circuits,”Distributed Computing, Vol. 1, No. 4, pp. 226-234, 1986; U. V. Cummings,A. M. Lines, A. J. Martin, “An Asynchronous Pipelined Lattice StructureFilter.” Advanced Research in Asynchronous Circuits and Systems, IEEEComputer Society Press, 1994; A. J. Martin, A. M. Lines, et al, “TheDesign of an Asynchronous MIPS R3000 Microprocessor.” Proceedings of the17th Conference on Advanced Research in VLSI, IEEE Computer SocietyPress, 1997; and A. M. Lines, “Pipelined Asynchronous Circuits.” CaltechComputer Science Technical Report CS-TR-95-21, Caltech, 1995; the entiredisclosure of each of which is incorporated herein by reference for allpurposes.

See also U.S. Pat. No. 5,752,070 for “Asynchronous Processors” issuedMay 12, 1998, and U.S. Pat. No. 6,038,656 for “Pipelined Completion forAsynchronous Communication” issued on Mar. 14, 2000, the entiredisclosure of each of which is incorporated herein by reference for allpurposes.

At the outset, it should be noted that many of the techniques andcircuits described in the present application are described andimplemented as delay-insensitive asynchronous VLSI. However it will beunderstood that many of the principles and techniques of the inventionmay be used in other contexts such as, for example, non-delayinsensitive asynchronous VLSI and synchronous VLSI.

It should also be understood that the various embodiments of theinvention may be implemented in a wide variety of ways without departingfrom the scope of the invention. That is, the asynchronous processes andcircuits described herein may be represented (without limitation) insoftware (object code or machine code), in varying stages ofcompilation, as one or more netlists, in a simulation language, in ahardware description language, by a set of semiconductor processingmasks, and as partially or completely realized semiconductor devices.The various alternatives for each of the foregoing as understood bythose of skill in the art are also within the scope of the invention.For example, the various types of computer-readable media, softwarelanguages (e.g., Verilog, VHDL), simulatable representations (e.g.,SPICE netlist), semiconductor processes (e.g., CMOS, GaAs, SiGe, etc.),and device types (e.g., FPGAs) suitable for designing and manufacturingthe processes and circuits described herein are within the scope of theinvention.

The present application also employs the pseudo-code language CSP(concurrent sequential processes) to describe high-level algorithms. CSPis typically used in parallel programming software projects and indelay-insensitive VLSI. It will be understood that the use of thisparticular language and notation is merely exemplary and that thefundamental aspects of the present invention may be represented andimplemented in a wide variety of ways without departing from the scopeof the invention.

In addition, transformation of CSP specifications to transistor levelimplementations for various aspects of the circuits described herein maybe achieved according to the techniques described in “PipelinedAsynchronous Circuits” by A. Lines (incorporated by reference above).However, it should be understood that any of a wide variety ofasynchronous design techniques may also be used for this purpose.

The CSP used herein has the following structure and syntax. A process isstatic and sequential and communicates with other processes throughchannels. Together a plurality of processes constitute a parallelprogram. The [and] demark if statements, and a *[and] demark loops.

Multiple choices can be made by adding pairs of B→S inside an ifstatement or a loop, separated by a

(indicates deterministic selection) or a ▮ (indicates non-deterministicselection), where B is a Boolean expression and S is a statement. Thus[B1→S1

B2→S2] means if expression B1 is true, execute S1 or if expression B2 istrue, execute S2. If neither B1 or B2 is true, this statement will waituntil one is (unlike an if-else construct). The shorthand *[S] meansrepeat statement S infinitely. The shorthand [B] means wait for Booleanexpression B to be true. Local variables are assumed to be integers, andcan be assigned to integer expressions as in x :=y+1. The semicolonseparates statements with strict sequencing. The comma separatesstatements with no required sequencing. The question mark andexclamation point are used to denote receiving from and sending to achannel, respectively. Thus *[A?x; y :=x+1; B!y] means receive integer xfrom channel A, then assign integer y to the expression x+1, then send yto channel B, then repeat forever.

According to various specific embodiments of the invention, the latchingof data happens in channels instead of registers. Such channelsimplement a FIFO (first-in-first-out) transfer of data from a sendingcircuit to a receiving circuit. Data wires run from the sender to thereceiver, and an enable (i.e., an inverted sense of an acknowledge) wiregoes backward for flow control. According to specific ones of theseembodiments, a four-phase handshake between neighboring circuits(processes) implements a channel. The four phases are in order: 1)Sender waits for high enable, then sets data valid; 2) Receiver waitsfor valid data, then lowers enable; 3) Sender waits for low enable, thensets data neutral; and 4) Receiver waits for neutral data, then raisesenable. It should be noted that the use of this handshake protocol isfor illustrative purposes and that therefore the scope of the inventionshould not be so limited.

According to specific embodiments, the delay-insensitive encoding ofdata is dual rail, also called 1of2. In this encoding, 2 wires (rails)are used to represent 2 valid states and a neutral state. When bothwires are low, the data is neutral. When the first wire is high the datais valid 0. When the second wire is high the data is a valid 1. Bothwires aren't allowed to be high at once. The wires associated withchannel X are written X⁰, X¹ for the data, and X^(e) for the enable.

According to other embodiments, larger integers are encoded by morewires, as in a 1of3 or 1of4 code. For much larger numbers, multiple1ofN's are used together with different numerical significance. Forexample, 32 bits can be represented by 32 1of2 codes or 16 1of4 codes.In this case, a subscript indicates the significance of each 1ofN code,i.e., L_(g) ^(r) is the rth wire of the gth bit (or group), and L_(g)^(e) is the associated enable.

According to still other embodiments, several related channels may beorganized into a 1-D or 2-D array, such as L[i] or V[i, j]. To identifyindividual wires in such embodiments, the notation L[i]^(r) or L[i]_(g)^(r) is used.

According to a specific embodiment, the design of a crossbar accordingto the invention employs a method described in U.S. Pat. No. 6,038,656(incorporated herein by reference above) to improve the speed of largedatapaths. This method describes a way of breaking up the datapath intomultiple datapaths of smaller bit sizes, for example, reducing onethirty-two bit datapath into four eight bit datapaths, while preservinginsensitivity to delays.

Some of the figures in this disclosure include box and arrow diagramsand transistor diagrams. In the box diagrams, the boxes representcircuits or processes and the arrows represent FIFO channels between theboxes. FIFO channels may also exist within the boxes. Any channel orwire with the same name is intended to be connected, even when noconnection is drawn. Sometimes the “internal” port names of a circuitare drawn inside the box next to an incoming or outgoing channel.

In the transistor diagrams, arrows (or lines) represent individualwires. Standard gate symbols are used wherever possible, with theaddition of a C-element, drawn like a NAND gate with a “C” on it. Thisgate is a standard asynchronous gate, also called a Mueller C-element ora consensus element. A gate representation and a transistor levelimplementation of a C-element 100 are shown in FIG. 1.

It should be noted that for the purpose of clarity, certain features areomitted from the circuit diagrams. For example, some circuit nodes are“dynamic” which means that they are not always driven high or low, andare expected to hold their state indefinitely. This requires a“staticizer,” i.e., a pair of small cross-coupled inverters attached tothe node. Staticizers are omitted, but can be inferred to exist on anynode where the pull-up and pull-down networks are not logicalcomplements (essentially all non-standard gates and C-elements). Inaddition, most of these pipelined circuits must be reset to an initialstate when the chip boots, which requires a few extra transistors usingReset and {overscore (Reset)} signals. Usually the reset state isachieved by forcing the left enables low while Reset is asserted.

As described herein, a Split is a 1 to Q bus which reads a controlchannel S, reads one token of input data from a single L channel, thensends the data to one of Q output channels selected by the value readfrom S. A Merge is a P to 1 bus which reads a control channel M, thenreads a token of data from one of P input channels as selected by thevalue read from M, then sends that data to a single output channel R.FIG. 2 shows a basic block diagram of a Split 200. FIG. 3 shows a basicblock diagram of a Merge 300. See also “Pipelined Asynchronous Circuits”by A. Lines incorporated by reference above.

According to various embodiments of the invention, a P to Q crossbar 400may be constructed from P Q-way splits and Q P-way merges as shown inFIG. 4. The ith of the P split busses, i.e., split[i], runs the program*[S[i]?j, L[i]?x; V[i,j]!x]. The jth of the Q merge busses, i.e.,merge[j], runs the program *[M[j]?i; V[i,j]?x; R[j]!x]. According to afirst asynchronous crossbar design which may be employed with variousembodiments of the invention, the V[i,j] represent intermediate datachannels between the split data outputs and the merge data inputs.According to specific embodiments of the invention described below,these channels have been eliminated.

Crossbar 400 is controlled from both the input and output sides via theS[i] and M[j] control channels. Based on the information in thesecontrol channels, the sequence of tokens sent through each channel iscompletely deterministic with respect to the input and output channels,but not with respect to any larger group of channels. That is, thetiming of communications on unrelated channels is unconstrained. Any twounrelated pairs of input/output ports can communicate in parallelwithout any contention. If two input/output transfers refer to the sameinput or output port, the control stream associated with that port willunambiguously determine the ordering. Various techniques for generatingthe information in these control channels are described below.

As mentioned earlier in this document, one type of asynchronous crossbardesigned according to the present invention includes actual channelsV[i,j] for passing information from a split bus to the designated mergebus. These channels may be used to advantage in a variety of ways. Forexample, varying amounts of buffering may be added to the intermediatechannels associated with each link to achieve various performanceobjectives. However, because of these channels and the associatedhandshaking overhead, the size and/or power consumption of anasynchronous crossbar designed in such manner could be prohibitivedepending upon the magnitude of either P or Q.

Thus, a specific embodiment of the invention provides a crossbar designwhich eliminates at least some of these channels by combining at least aportion of the split and merge functionalities into a single router cell(the notation router_cell is also used herein). The externally visiblebehavior of an asynchronous crossbar designed according to thisembodiment is virtually identical to that of the same size (i.e., P toQ) crossbar including the V[i,j] channels except that the enhancedcrossbar design has one stage less slack (i.e., pipeline depth).

A specific embodiment of a crossbar designed according to the presentinvention will now be described with reference to FIGS. 5-8. Accordingto this implementation, each split bus includes one split_env part and Qsplit_cell parts, and each merge bus includes one merge_env part and Pmerge_cell parts. The split_cell contains the part of the split busreplicated for each output channel, and the split_env contains the restof the circuitry. Likewise, the merge_cell contains the part of themerge bus replicated for each input channel. As will be discussed withreference to FIG. 9, and according to a specific embodiment, thefunctionalities of each pair of split_cell and merge_cell correspondingto a particular input/output combination is combined into a singlerouter_cell, thus eliminating the intervening channels between the splitand merge busses.

Functionally, each split_cell[i,j] waits for S[i] to be valid and checksthat the value of S[i] equals j (that is, S[i]^(j) is true). If so, itchecks the enable from its output V[i,j]^(e) and when that is high, itcopies the valid data from L[i] to V[i,j]. Once the data are copied toV[i,j], the split_cell[i,j] lowers its enable to the split_env, se[i,j].Eventually, the S[i], L[i], and V[i,j]^(e) return to neutral, so thatthe split_cell[i,j] can reset the data and raise se[i,j] again. Aschematic for a split_cell 500 with 1-bit data and 1-bit control (bothencoded as 1of2 codes) is shown in FIG. 5.

The split_env[i] tests the validity and neutrality of the L[i] channel,computes the logical AND of the se[i, 0 . . . Q−1]'s from thesplit_cell's, and produces an acknowledge for the S[i] and L[i] inputchannels. The validity and neutrality of the S[i] channel is implied bythe acknowledges from the split_cell's. A schematic for a split_env 600for 1-bit data and 2 split_cell's is shown in FIG. 6.

Each merge_cell[i,j] waits for M[j] to be valid and checks that thevalue of M[j] equals i (that is, M[j]^(i) is true). If so, it waits fora go[j] signal from the merge_env (which includes the readiness of theoutput enable R[j]^(e)) and for the input data V[i,j] to be valid. Whenthis happens, it copies the value of V[i,j] to R[j]. The merge_envchecks the validity of R[j] and broadcasts this condition back to allthe merge_cells's by setting rv[j] high. Next, the merge_cell lowers itsenables me[i,j] and V[i,j]^(e). Once the M[j] and V[i, j] data return toneutral, and go[j] is lowered, the R[j] is returned to neutral, rv[j] islowered, and the merge_cell raises the enables me[i,j] and V[i,j]^(e). Aschematic for a merge_cell 700 with 1-bit data and 1-bit control(encoded as 1of2 codes) is shown in FIG. 7.

The merge_env checks the readiness of the R[j] acknowledge and raisesgo[j]. The M[j] goes directly to the merge_cell's, one of which respondsby setting R[j] to a new valid value. The merge_env then raises rv[j],after which the merge_cell replies with me[i,j]. The merge_env[j] checksthe completion of these actions, and then acknowledges M[j]. Once M[j]has become neutral again and R[j] has acknowledged, the merge_env[j]lowers go[j], which causes the merge_cell's to reset me[i,j]. Themerge_env[j] also resets R[j] to the neutral value. Once these actionshave been completed, the merge_env[j] lowers the acknowledge of M[j]. Aschematic for a merge_env 800 for 1-bit data and 2 merge cells is shownin FIG. 8.

According to another specific embodiment of the invention, at each gridin a crossbar (i.e., for each combination of i and j) there is arouter_cell[i,j] which combines the functionalities of onesplit_cell[i,j] and one merge_cell[i,j] as described above. Thesplit_env[i] and merge_env[j] communicate with their router_cell's usingthe handshaking protocol described above. The router_cell waits for thesuperset of all conditions of the separate split_cell and merge_cell andperforms the actions of both with respect to their env's.

It should be noted that embodiments of the invention are envisioned inwhich only selected links are implemented with the router_cell of thepresent invention. According to such embodiments, other links areimplemented using the split_cell and merge_cell of FIGS. 5 and 7 andtheir associated intermediate channels V[i,j]. Such embodiments might beuseful where, for example, additional buffering is desired on one ormore specific links, but it is undesirable to pay the area penaltyassociated with having intermediate channels for every link.

According to an even more specific embodiment, the router_cell does thefollowing. It waits for its S[i] input to be valid and equal to j, forits M[j] input to be valid and equal to i, for L[i] to be valid, and forgo[j] from the merge_env to be high. Once all this happens, therouter_cell[i,j] copies L[i] directly to R[j] without an intermediateV[i,j] channel. The merge_env[j] detects that the R[j] has been set, andsignals that by raising rv[j]. Then the router_cell[i,j] can lower itsenables to the env's, se[i,j] and me[i,j], which can be the same signal.

The reset phase proceeds symmetrically. The router_cell waits for S[i]and M[j] to be neutral and go[j] to go down. The merge_env[j] will resetthe R[j] to neutral, and then signal the completion by lowering rv[j].Finally, the router_cell[i,j] raises its enables to both env's. Theschematic for a router_cell 900 with 1-bit data and 1-bit S[i] and M[j]is shown in FIG. 9. According to a specific embodiment, the split_envand merge_env employed with router_cell 900 may be the same as thoseused with separate split_cell's and merge_cell's (see FIGS. 6 and 8above).

As will be understood and according to various embodiments, either ofthe basic crossbar implementations can be extended to different datasizes and P and Q values. There are also several circuit variationswhich may improve area or speed described subsequently. That is, variousdifferent data encodings, router cell circuit implementations, and othercircuit implementation variations described subsequently representvarious tradeoffs between area and speed.

According to various embodiments, the S[i] may be encoded with a 1ofQchannel to select among Q possible split_cells. This increases thefanout on the S wires, and requires a larger AND tree to combine these[i,j]'s in the split_env. Likewise, the M[j] may be encoded with a1ofP channel to select among P possible merge_cell's. The number ofcontrol wires scales linearly with P and Q, which is suitable forsmaller crossbars, e.g., 8 by 8 or smaller. According to even morespecific embodiments, the AND trees for se and me are physicallydistributed across the datapath to reduce wiring.

For larger crossbars, e.g., 16 by 16 or larger, the S[i] and M[j] caneach be encoded with a pair of 1ofN codes 1ofA by 1ofB, which yields A*Bpossibilities. The least and most significant halves of the S controlare called S[i]₀ and S[i]₁. Likewise for M[j]₀ and M[j]₁. The wiringcost of this encoding scales with the √{square root over (P)} or√{square root over (Q)}, and works well up to 64 by 64 crossbars. In adelay-insensitive design, it is possible to check only one of theS[i]₀/S[i]₁ pair for neutrality in the router_cell, provided thesplit_env checks the other one. Likewise for the M[j]₀/M[j]₁ pair.

With a large P or Q, the number of transistors used to detect when acertain router_cell is selected (also referred to as a “hit”) becomesincreasingly complicated, and this cost is duplicated for all datawires. Therefore, according to one embodiment for a crossbar having alarge P, Q, or data size, a hit[i,j] signal is computed in a single hitcircuit rather than using the S and M wires directly. An examplerouter_cell 1000 with 1-bit data and 2×1of4 control using a symmetrichit circuit is shown in FIG. 10. An alternate router_cell 1100 using anasymmetric hit circuit which does not check neutrality of S[i]₁ or M[j]₁ is shown in FIG. 11. The asymmetric hit circuit requires that thesplit_env and merge_env are modified to check the neutrality of S[i]₁and M[j]₁, respectively.

According to various embodiments, it is straightforward to modify thedata encoding to other 1ofN codes, e.g., from a 1of1 to signal an event,to 1of4 for a good low power encoding of 2 bits, and so on. According toembodiments with larger data sizes, multiple 1ofN codes may be employed.FIG. 12 shows a router_cell 1200 with 4-bit data and control encodedwith 2×1of4 channels, using the asymmetric hit circuit of FIG. 11. It ispossible to use different rv[j]₀/rv[j]₁ and go[j]₀/go[j]₁ wirescorresponding to each 1of4, as shown, or to combine them into singlerv[j] and go[j] from the merge_env.

According to various specific embodiments, multicast may be supported ina crossbar designed according to the present invention. According to onesuch embodiment, the S[0 . . . P−1] control is changed from a 1ofQ codeto a bit vector S[0 . . . P−1, 0 . . . Q−1] of 1of2 codes. Each S[i,j]bit goes directly to the router_cell[i,j]'s, where the S[i,j]¹ wire isused in the hit circuit instead of S[i]^(j). In the split_env, these[i,j] signals are first AND'ed with the inverse of S[i,j]⁰ and thencombined with a C-element tree instead of an AND tree. Essentially,multiple simultaneous hit's can occur for one input, and the split_envmust check that they all complete. The merge side is controlled asbefore. It should be noted that implementations of the dispatch andarbiter circuits described subsequently herein may be configured tocontrol such a multicast crossbar.

Various embodiments of an asynchronous crossbar designed according tothe present invention are organized into several parallel chunks of lessthan the datapath size. Assuming the datapath size is B bits (whichrequires 2*B wires for the delay-insensitive code in this embodiment),the number of additional control wires used in a split is s, and thenumber of additional control wires used in merge is m (for an embodimentwhich uses 1-hot control encoding), if the datapath is broken up intochunks of C bits, then the wiring limited area of the crossbar will be(B/C)*P*Q*(2*C+s)*(2*C+m). Thus, the optimum C is$\frac{\sqrt{s*m}}{2}.$

Using this formula, a 32-bit datapath with 12 wires of split controloverhead and 14 wires of merge control overhead should be broken into achunk size of about 6 to 7 bits. In practice, other factors come intoconsideration, such as the desired speed of the circuit (which favorssmaller chunks) and the convenience of various chunk sizes. For example,depending upon such considerations, a 32-bit crossbar could beimplemented as 8 chunks of 4 bits (faster) or 4 chunks of 8 bits(smaller). Other chunk sizes might have unacceptable area, speed, orinconvenience penalties but are still within the scope of the presentinvention.

Various techniques for generating the S[i] and M[j] control channels foran asynchronous crossbar will now be described. It will be understoodthat such techniques may be applied to any of a variety of asynchronouscrossbar architectures including, for example, the different crossbarsdescribed above. That is, the dispatch and arbiter circuits describedherein may be employed not only to control any of the crossbar circuitsdesigned according to the invention, but any type of crossbar circuithaving the basic functionality of interconnecting P input channels withQ output channels. According to various embodiments, control ofmulticast crossbars and two-way transactions may also be provided byspecific implementations of these circuits.

According to various embodiments of the invention, the partial (orprojected) order of the data transfers in a P to Q crossbar, i.e., theorder of operations when projected on a given channel, should bedeterministic. That is, the order of operations which involve a certainchannel happen in a deterministic order, but operations on differentchannels can happen in any order relationship to each other. Thus,according to one such embodiment, a dispatcher is provided which solvesthe following problem: Given an ordered sequence of input instructionson channels L[0 . . . P−1], route each instruction to one of R[0 . . .Q−1] output channels specified by a TO[0 . . . P−1] channel for thatinstruction.

The dispatcher must maintain the order of instructions to each outputchannel. However, it is not required that instructions to differentoutput channels are delivered in order. This allows internal pipeliningin the implementation, arbitrary buffering on all channels, and multiplesimultaneous transfers.

Where P is 1, a straightforward implementation of dispatcher is just anQ-way split bus, using L, and TO as S, and R[0 . . . Q−1]. According toan even more specific embodiment, additional buffering may be providedon the output channels to allow later instructions to be issued despitean earlier stalled instruction to a different R.

According to another embodiment, multiple instructions are issued inparallel with proper ordering using a crossbar. The L[i] and R[j] datachannels of the dispatcher connect directly to the crossbar. The TO[i]of the dispatcher is copied to the crossbar's S[i]. The M[j] crossbarcontrol channels are derived from the TO[i]'s such that they maintainthe program order projected on each output channel. According to oneembodiment, this is accomplished in the following manner.

Referring to dispatcher 1300 of FIG. 13, each input_ctrl[i] sends arequest bit req[i,j] (e.g., a 1of2 code) to each output_ctrl[j]indicating whether or not this input wishes to go to that output basedon TO[i]. Then each output_ctrl[j] collects these bits from allinput_ctrl's and determines the indices of each 1 in cyclic order. Theseindices control the M[j] channel of the crossbar. The crossbar thentransfers the payload.

The input controller, e.g., the input_ctrl[i] circuit, to produce thereq[i,j] bits and copy TO[i] to S[i] may be derived using the approachdescribed in “Pipelined Asynchronous Circuits” by A. Lines incorporatedby reference above.

Each output controller (also referred to herein as a combine) accepts abit vector and reads off the positions of all 1's in cyclic order frominput 0 to P−1. According to one embodiment, this is achieved using abinary tree structure. Each stage in the tree receives the number of 1'son its lower significance L input, then from its higher significance Hinput, and outputs the sum to the next stage of the tree. These numbersare encoded serially with a 1of3 code with the states: zero, last, andnot-last. For example, 3 is represented by the sequence: not-last,not-last, last.

Each tree stage also outputs a 1of2 channel to indicate whether the 1came from the low (0) or high (1) sides. This extra channel becomes theMSB bit of the index so far. The LSB bits so far are obtained by a 2-waymerge of the index from either the low or high previous stage,controlled by the current MSB bit. The final 1of3 bit sum of the tree isdiscarded, and the accumulated index bits become the M control for thecrossbar.

According to various specific embodiments of the invention, the combinemay be implemented using the approach described in “PipelinedAsynchronous Circuits” by A. Lines incorporated by reference above. Insuch embodiments, one internal state bit is provided to distinguishsequences coming from the left or right sides. FIG. 14 shows a 4-waytree combine 1400. The CSP for a specific embodiment of such a combinecircuit is as follows: “zero”:=0, “notlast”:=1, “last”:=2; *[L?1;  [1=“zero” -> H?h;   [ h=“zero” -> R!“zero”,  done:=true   [ ]h=“notlast”-> R!“notlast”, M!1, done:=false   [ ]h=“last” -> R!“last”, M!1,done:=true   ];   *[˜done -> H?h;     [ ]h=“notlast” -> R!“notlast”, M!1    [ ]h=“last” -> R!“last”, M!1, done:=true     ]   ]  [ ]1=“notlast”-> R!“notlast”, M!0  [ ]1=“last” -> M!0, H?h;   [ h=“zero” -> R!“last”,done:=true   [ ]h=“notlast” -> R!“notlast”, R!“notlast”, M!1,done:=false   [ ]h=“last” -> R!“notlast”, R!“last”, M!1, done:=true   ];  *[˜done -> H?h;     [ ]h=“notlast” -> R!“notlast”, M!1     []h=“last” -> R!“last”, M!1, done:=true     ]   ]  ] ]

L and H are input request counts encoded serially with 1of3 codes. R isthe output request count encoded serially. M is the most-significant-bitof the next index so far and controls the merge of the accumulatedleast-significant-bits from previous stages.

Although the combine can be implemented as a tree using existingtechniques, a smaller implementation which may be advantageous for largefan ins is also provided which uses a rippling ring circuit whichinspects each input request in cyclic order, driving a corresponding1ofN data rail if its input is 1, or skipping ahead if the input is 0.The rails of this 1ofN code must be kept exclusive. This version of thecombine has irregular throughput and latency characteristics, and mayonly be valuable for its area savings for large fan ins.

According to various specific embodiments, a crossbar is used to executea series of “move” instructions, each of which specifies an input portand an output port of the crossbar and transfers several tokens acrossthat link. In one such embodiment, the move instruction identifies theinput port, the output port, and a repeat count. According to an evenmore specific embodiment, an ordered sequence of these move instructionsis issued in parallel via two dispatch circuits. It will be understoodthat the repeat count is merely one mechanism which this embodiment mayemploy.

According to this embodiment, the first dispatch circuit dispatches theoutput port and repeat count to the specified input port. The seconddispatches the input port and repeat count to the output port. That is,the move instruction is copied two ways, with either the input or outputport serving as the S control for the corresponding dispatches. Therepeat count is unrolled locally to the input and output ports. That is,the same crossbar control is reissued until the count is used up. Aspecific implementation of a dispatcher 1500 having two such dispatchcircuits is shown in FIG. 15.

The use of the dispatchers ensures that the moves will be executed inthe original program order if they have either port in common, but mayexecute them out of order or in parallel if they refer to differentports. The dispatchers are also capable of scaling up to a very largenumber of move instructions at once. This can be used as an optimizationto avoid wasting power or bandwidth in the dispatcher, and also cangreatly compress the original instruction stream.

Another embodiment of the invention facilitates use of a crossbar as amessage passing communications interconnect. According to thisembodiment, it is assumed that each input port provides the desireddestination port number on a TO channel, which becomes the S control ofthe crossbar. Each input port requests permission to use the desiredoutput port. Each output port generates the M control by arbitratingamong the requests from all inputs contending for access to the sameoutput. An optional FROM channel can be sent with the output which maycomprise, for example, a copy of the M control of the crossbar. Such anoption may be useful, for example, with certain communication protocolsin which it is desirable to know the identity of the sender.

The control per input copies the TO to S and uses it as the control fora split bus which sends a 1of1 request channel req[i,j] to the intendedoutput control. The control per output collects the requests from theinput controls and arbitrates among them. The result of the arbitrationis used as the M of the crossbar, and may also be copied to a FROMchannel if desired.

According to one embodiment, a P-way arbiter which arbitrates among therequests is built as a binary tree, much like the combine of the lastsection. Each stage in the binary tree receives a request from eitherthe left (e.g., lower indices) or right (e.g., higher indices) sides ofthe previous stage. It outputs a 1of2 channel for the winner's MSB to aside tree of merge's which accumulate the index of the winner, just asfor the combine. It sends a 1of1 to request the next stage of the tree.FIG. 16 shows a tree structure 1600 for an 8-way arbiter.

According to a specific embodiment of an arbiter, the circuit for eachstage of the arbiter includes metastability. The CSP is:

*[{overscore (L[0])} → L[0]?, T!, A!0 ¥ {overscore (L[1])} → L[1]?, T!,A!1]

where L[0 . . . 1] are the trigger inputs, T is the trigger output, andA is the arbitration result. FIG. 17 shows one implementation of acircuit 1700 with this behavior. According to this embodiment, theoutput request is made by OR'ing the input requests and is notmetastable. Only the side 1of2 A output employs actual arbitration and ametastability filter. This arbiter tree is weakly fair, and works asfirst-come-first-serve if contending requests are spaced out enough intime. If the contending requests come faster, all requests will beserviced, but not necessarily at strictly fair rates.

According to a further embodiment, arbitrated control of a crossbar isfacilitated by an arbiter which avoids deadlock conditions. As mentionedabove, the crossbar controlled by such an arbiter may be any type ofcrossbar including, but not limited to, those described herein.

Suppose an input port A is trying to go to output C then D, and anotherinput port B is trying to go to outputs D then C. Due to slack in therequest and arbitration channels, it is possible under a delayinsensitive timing model that A would win D and B would win C. But A istrying to send to C first, and B is trying to send to D first. Thus, thesystem deadlocks.

Thus, according to a specific embodiment, “slack” is eliminated so thatan input can't make another request until the previous one has won itsarbitration. This is done by introducing a “grant” token (e.g., a 1of1channel) which is returned by the output port to the input port whenthat input wins the arbitration. This mechanism prevents inputs frommaking more than one outstanding request.

According to one implementation, the grant is returned via a smallcrossbar with its S control copied from the output's M and its M controlcopied from the input's S. The output R 1of1 data channel is fed intothe input's split bus. The input side starts with a single grant token.FIG. 18 shows an arbiter 1800 for effecting arbitrated control for acrossbar using this grant scheme.

The grant crossbar of the present invention is also operable toestablish a useful ordering relationship. Suppose an input sends somedata to output B, then sends a notification to output C that the data isready. If C then reads it from B, it will get the value that A wrote,because A's communication to B won the arbitration first. This satisfiesthe producer-consumer ordering model required by many bus protocols.

According to other embodiments of the invention, alternatives to usingsuch a grant crossbar are provided. In general, to avoid deadlock, it isnecessary to avoid winning the arbitrations in a different order fromwhich they were requested. One way to do this is to implement therequest/arbiter circuits with a total of 1 or less slack, such that asecond request will always be blocked until the first one has beengranted. This avoids the need for a grant crossbar, and can be smaller.However, this zero-slack design reduces the throughput (since thecircuits cannot precharge in parallel with another request starting) andrequires different zero-slack implementations of the components insteadof the usual pipelined building blocks. The grant crossbar iseffectively a way of forcing the pipeline to have 1 slack even if it isbuilt out more pipelined elements.

Transactions in a typical system interconnect often have atomic sizeslarger than one word. That is, for one request and arbitration, manycycles of data may need to be transferred. This can be achievedaccording to one embodiment of the present invention by associating a“tail” bit with the data through the main crossbar. According to thisembodiment, the tail bit is sampled both by the input and output ports,and is fed into a simple control unit which repeats the same controlvalues until the tail bit is 1. According to other embodiments, a simplecounter may be employed using information associated with the dataitself (e.g., in a packet) or which comes with the control datacontrolling the crossbar. As will be understood, these are merelyexamples of mechanisms which embodiments of the invention may employ toeffect the transfer of data of arbitrary size. The scope of theinvention should not be so limited.

A request/arbitrate circuit designed according to specific embodimentsof the present invention is concerned only with “packets” and sets upthe datapath link according to the received control values. The datapathcrossbar can transfer a large block of data, then release the link afterthe last cycle by setting the tail bit to 1. FIG. 19 shows a datapathcrossbar 1900 with the extra repeaters on the control inputs. Accordingto an alternate embodiment, a repeat count could be used instead of thetail bit. However, the tail bit may be easier to implement in thehardware, and doesn't prohibit also specifying lengths in the datapackets.

According to further embodiments of the invention, two differentcrossbar datapaths are controlled using a single arbitrated controlcircuit to implement two-way transactions. According to one suchembodiment, input and output 1of2 channels LTYPE and RTYPE are added toan arbiter circuit designed according to the invention for each port. Ifthe LTYPE channel is 1, the normal S/M control is also copied to becomethe M/S control of a second crossbar for a returning transaction. If theLTYPE channel is 0, the second crossbar isn't used. The information inthe LTYPE channel is copied to the RTYPE channel of the output, so thatthe target unit knows whether or not to respond. This implementation cansupport a mixture of 1-way transactions (e.g., store) and 2-waytransactions (e.g., load, swap, read-modify-write). According to morespecific embodiments, if the modules which are connected by the twocrossbars are exclusively masters (initiators) or targets (responders),the two crossbars can be asymmetrically sized, (e.g., an 8×4 requestcrossbar and a 4×8 response crossbar). According to one such embodiment,this scheme is used to efficiently implement a shared memory bridge.

Some additional exemplary applications of the three types ofasynchronous circuits described above will now be discussed. However, itwill be understood that the crossbars, dispatchers, and arbiters of thepresent invention may be used in a wide variety of applications and thattherefore the scope of the present invention is not limited to theapplications described.

In one such exemplary application, a superscalar CPU with P-wayinstruction issue and Q pipelines could use a P×Q dispatcher to sendinstructions to the correct pipelines while preserving ordering to eachpipeline. The TO control would be decoded from the instructions.

In other exemplary embodiments relating to RISC style superscalarasynchronous CPUs, crossbars can be used to route the Z result of anyexecution pipeline to any register, or to route the reads from anyregister to the X and Y operands of any pipeline. Each register coulddelay a write until the next access of that register, such that any datadependent read could be quickly bypassed. The latency from Z result to adependent X or Y operand could be as little as 6 transitions, 2 each forthe result crossbar, register itself, and operand crossbar. This lowlatency bypass feature eliminates the need for additional bypasscircuitry. The control of these crossbars can be generated from parallelRISC instructions using variations on the “move” control scheme. Thisimplementation is large, but allows significant reordering (i.e., itonly retains the partial ordering projected on results, operands, andregisters) and can scale to very wide issue designs. Even with adual-issue CPU, this register file could often do more than twoinstructions at once for short bursts, which could help catch up afterstalls.

According to various embodiments, an arbitrated crossbar designedaccording to the invention can be used to connect several modules on achip. Each module would be given an input and output port on thecrossbar. In some embodiments, each module would be able to send one-waytail-terminated packets to each of the other modules. Some modules couldbe memories, which could receive stores, and reply to load requests withload completion packets. Others could be I/O interfaces, especiallythose based on flow-controlled bidirectional FIFOs. Others could beCPU's or DSP's or ASIC's which can access the I/O's, memories, or sendpackets to each other. These packets could be used to implement cachecoherence protocols or hardware supported message passing. In addition,legacy bus protocols such as PCI could be tunneled over such a crossbarsince it supports the required ordering relationships.

According to further embodiments, an arbitrated crossbar designedaccording to the invention could act as a switch fabric for packetswitching. Each incoming packet would have an in-band destination fieldwhich would be extracted for use as the TO control. The length of thepacket would be converted into a tail bit sequence. The FROM outputcould be inserted back into the packet if desired. According to morespecific embodiments, in the presence of contention, it may be desirableto add FIFOs on all inputs and outputs and make sure the whole systemhas a significant overspeed to recover from transient congestion.

It should be noted that although specific embodiments have beendescribed herein with reference to a delay-insensitive handshakeprotocol, various embodiments of the invention are provided in whichdifferent types of timing assumptions are made in otherwisedelay-insensitive circuits. For example, timing-assumptions may be usedto make an otherwise delay-insensitive circuit faster and lower power atthe cost of additional circuit verification engineering. The best timingassumption for a particular circuit depends on the critical path of thecircuit and the amount of additional verification work a designer iswilling to take on. Of particular interest are timing assumptions thatare local to one four-phase handshake (described below), or one internalpath within one cell between external handshakes. When this class oftiming-assumptions is applied to complex cells with critical pathslonger than the rest of the delay-insensitive circuitry, it isespecially desirable. These timing assumptions apply to asynchronouscircuits that use four-phase return to neutral handshakes, and generallyuse 1-hot data encoding.

In general, there are three types of timing assumptions which may applyto various embodiments of the invention. When the pulse timingassumption is applied to an otherwise delay insensitive four-phasehandshake, all of the set conditions are completed, data validity,control validity, acknowledge validity, etc. However, the reset phase ofthe handshake is not completed and is assumed to happen with an adequatetiming margin. In this scheme all signals, data, control, and all theacknowledge signals from output channels, are not checked in thereset-phase of the handshake, with the exception that occasionally anacknowledge signal is used opportunistically as a good pre-charge signalfor the data. In some cases one may also forego checking the completionof the output data. This scheme requires that once the link is set up,nothing may block the data from being computed and the channels fromgoing through the reset phase.

When the Implied-data-neutrality timing assumption is applied to anotherwise delay-insensitive four-phase handshake, the computed data onthe output channels is completed in the set direction, but not in thereset phase. All acknowledges are still checked in all directions. Thisscheme requires that once the acknowledge of an output channel is set,no events may block the reset phase of the data channel.

Interfering operators are common in circuit design in general but areforbidden by the delay-insensitive timing model. Interference causesglitching. In delay-insensitive circuit design cut-off transistorsprevent interference. However, with adequate timing margin, a circuitdesigner can guarantee glitch free operation in an otherwisedelay-insensitive circuit.

A specific example of the use of such timing assumption in circuitsdesigned according to the invention will be illustrative. A 16 to 16ported 4-bit crossbar efficiently implemented according to a specificdelay-insensitive approach of the present invention requires 20transitions per cycle. However, a crossbar design with similarfunctionality may be implemented with the timing assumptions describedabove which requires only 12 transitions per cycle. This theoreticallymakes the circuit 67% faster.

FIGS. 20A-20C show how the circuit diagrams for a router_cell 2000, asplit_env 2020, and a merge_env 2040 may be modified with these timingassumptions (relative to their above-described counterparts) to createsuch a 12-transistion per cycle crossbar. The sv and lv signalsrepresent the input completion of the l and s channels. The rv and mvsignals represent the completion of the output data on channel r and theinput control data on channel m.

The pulse timing assumption is used in the main data transfer throughsplit_env ->router cell->merge_env. This allows the removal of 2 NANDgate completions, and the rv bus signal. It also reduces the responsetime from the L and S arrival to the SE (L and S acknowledge) from 9transitions to 5. The interference timing assumption is used on the vebus in the figure, however at a little extra cost one could produce asignal from the split_env and pass it into the ve bus to remove theinterference timing assumption. In the buffers surrounding the split_envand merge_env, the implied-data-neutrality timing assumption is used tosatisfy the non-blocking return-to-neutral requirement of the pulsetiming assumption, and to keep the critical path of data completion on 21of4 codes to 12 transitions per cycle. It should be understood thatthere are numerous small trade offs in timing-assumptions that can bemade in such circuits, all of which are within the scope of thisinvention.

In addition, while several specific embodiments of the invention havebeen described in the context of asynchronous circuit design, it ispossible to map the event driven architecture of the crossbars describedherein into a synchronous environment with the introduction of a clocksignal and still remain within the scope of the invention. According toone such embodiment, a crossbar circuit architecture similar to thatdescribed above is implemented with the underlying channel model of asynchronous request-grant FIFO rather than an asynchronous four-phasechannel. Since the crossbar is still based on the four independent FIFOsL, S, M, and R, all of the properties that come from implementing thecrossbar with independent flow-controlled FIFO channels still apply. Thedifference is that data transactions begin aligned to a clock-edgeboundary. Such an approach may be desirable, for example, in a singleclock domain synchronous system because it relieves the requirement ofgoing through synchronous to asynchronous conversion and back again.

FIG. 21 is a simplified block diagram illustrating an exemplaryinterface 2100 for transferring data tokens from an asynchronous domain2102 to a synchronous domain 2104 according to a specific embodiment ofthe invention. According to the embodiment shown, a 32-bit wide datatoken, i.e., L[0 . . . 31], encoded using 1of2 encoding is assumed.However, it will be understood that data tokens having any number ofbits and encoded in many different ways may be transferred from onedomain to the other according to the described embodiment.

The 32-bit wide datapath includes a multi-stage buffer queue 2106 whichreceives and transfers the data tokens generated in the asynchronousdomain from one stage to the next according to the delay-insensitivehandshake protocol described above. Although buffer 2106 is shown having8 stages, i.e., being capable of accommodating 8 data tokens, it will beunderstood that according to various embodiments, the length of thisbuffer may vary. As the transfer of each data token into buffer 2106 isachieved, completion of the transaction for each of the bits is signaledbackwards by the first stage of buffer 2106 in accordance with thehandshake.

The datapath also includes one or more asynchronous-to-synchronous (A2S)datapath transfer units (one for each bit of the data token) representedby DTU block 2108. As will be described, DTU 2108 effects the transferof each data token to synchronous domain 2104 in response to an A2S “go”signal and the clock signal (CLK) associated with synchronous domain2104. The manner in which the A2S “go” signal is generated according toa specific embodiment of the invention is described below.

In response to the indication that each of the bits of the token hasbeen successfully transferred to buffer 2106 (i.e., the completedhandshake), completion block 2110 generates a 1of1 transfer tokenrepresenting the completed transfer. According to a specific embodiment,completion block 2110 employs a pipelined architecture to minimize theimpact of the latency inherent in generating a single transfer tokenfrom the completion signals for each of the bits of the data token. Aspecific implementation of such a completion block is described below.

The transfer token generated by completion block 2110 is received bycontrol block 2112 which, in turn, generates a request signal to thesynchronous domain indicating that valid data are available to betransferred. Upon receiving a grant signal from the synchronous domainand in response to a transition of the clock signal, control block 2112generates the A2S “go” signal which causes DTU block 2108 tosimultaneously latch all of the bits of the data token currently at theend of buffer 2106 to the synchronous domain. According to analternative embodiment in which the synchronous domain is always readyfor data, the grant and request signals may be omitted, the A2S “go”signal being generated in response to the transfer token and the clocksignal.

According to a specific embodiment, distribution of the A2S “go” signalamong the individual datapath transfer units in DTU 2108 is accomplishedusing a pipelined tree structure which minimizes the effect of thelatency inherent in such a distribution. According to an alternativeembodiment, the A2S “go” signal is distributed to the individualdatapath transfer units using an electrically continuous conductor,e.g., a single wire.

FIG. 22 is a simplified block diagram illustrating an interface 2200 fortransferring data tokens from a synchronous domain 2202 to anasynchronous domain 2204 according to another specific embodiment of theinvention. As with the embodiment discussed above with reference to FIG.21, an exemplary 32-bit wide data token, i.e., L[0 . . . 31], isassumed. Data tokens generated in the synchronous domain are transferredto the asynchronous domain via a datapath which includes a plurality ofsynchronous-to-asynchronous (S2A) datapath transfer units (shown as DTU2206) and a multi-stage buffer queue 2208.

Buffer 2208 receives and transfers the data tokens received from DTU2206 from one stage to the next according to the delay-insensitivehandshake protocol described above. And although buffer 2208 is shownhaving 8 stages, i.e., being capable of accommodating 8 data tokens, itwill be understood that according to various embodiments, the length ofthis buffer may vary. Data tokens generated in the synchronous domainare transferred into buffer 2208 by DTU 2206 in response to an S2A “go”signal generated by control block 2210. Generation of this S2A “go”signal is described below.

In response to the indication that each of the bits of the data token atthe end of buffer 2208 has been successfully transferred out of buffer2208, completion block 2212 generates a 1of1 transfer token representingthe completed transfer and the fact that room is now available in buffer2208 for at least one additional data token. According to a specificembodiment, completion block 2212 employs a pipelined architecture tominimize the impact of the latency inherent in generating a singletransfer token from the completion signals for each of the bits of thedata token. A specific implementation of such a completion block isdescribed below.

The transfer token generated by completion block 2212 is received andtransferred through the stages of transfer token buffer 2214 (which canaccommodate multiple tokens) according to the delay-insensitivehandshake protocol. The number of tokens in token buffer 2214 at anygiven time corresponds to the number of available spaces in buffer 2208for additional data tokens to be transferred from the synchronousdomain. The length of token buffer 2214 may vary according to differentimplementations, different buffer lengths being more suitable forparticular datapath widths.

When control block 2210 receives a transfer token from buffer 2214 andthere is an outstanding request from the synchronous domain for transferof a data token, control block 2210 generates a grant signal indicatingthat the asynchronous domain is ready to receive the data token. Controlblock 2210 also generates the S2A “go” signal which enables the transferof the data token by DTU 2206 to the first stage of buffer 2208.According to a specific embodiment, the S2A “go” signal is distributedamong the individual datapath transfer units of DTU 2206 using apipelined tree structure which minimizes the effect of the latencyinherent in such a distribution. According to an alternative embodiment,the S2A “go” signal is distributed to the individual datapath transferunits using an electrically continuous conductor, e.g., a single wire.

According to various embodiments, and as will be understood withreference to FIGS. 21 and 22 and the corresponding discussion, thepipelining of the various elements which generate and distribute the“go” signals results in a low latency solution by which large datatokens may be transferred between asynchronous and synchronous domains.According to some embodiments, the latency for large datapaths, e.g., 32or 64-bit, can be as little as one clock period.

For certain types of synchronous systems in which data transfers mustoccur in blocks of consecutive data and/or which are not tolerant ofwait states, the foregoing A2S and S2A interfaces may not be sufficientby themselves to effectively transfer data between domains. Therefore,according to various specific embodiments of the invention referred toherein as “burst mode” interfaces, solutions are provided which ensurethat the data transmission requirements of the synchronous domain aresatisfied.

FIG. 23 is a simplified diagram illustrating an exemplary “burst mode”interface 2300 for transferring data tokens from an asynchronous domain2302 to a synchronous domain 2304 according to a specific embodiment ofthe invention in which the synchronous domain expects data to betransmitted in uninterrupted blocks or “bursts” of consecutive tokens.It should be noted that although the term asynchronous may be used withrespect to certain circuitry, the nature of the interfaces of thepresent invention mean that timing constraints exist on the asynchronousside, e.g., the buffer must be fast enough to feed one data token perclock cycle. While this is a fairly easy constraint to meet in that sucha buffer feeds tokens through significantly faster than the typicalclock cycle, it is a constraint nevertheless.

According to a more specific embodiment, synchronous domain 2304 is asynchronous memory architecture and interface 2300 is a “write”interface. It should be understood, however, that a burst mode interfacedesigned according to the invention is more generally applicable thanthe specific implementation shown in FIG. 23. That is, variousimplementation details shown in FIG. 23 may not be necessary or may bereplaced with other details for burst mode interfaces designed for otherapplications.

According to the embodiment shown, a 32-bit wide data token, i.e., L[0 .. . 31], encoded using 1of2 encoding is assumed. However, it will beunderstood that data tokens having any number of bits and encoded inmany different ways may be transferred from one domain to the otheraccording to the described embodiment. Control information associatedwith the data token, e.g., a write command bit and the address to whichthe data are to be written, is split off from the data token andtransmitted via control path 2303. The 32-bit data tokens aretransmitted via data path 2305.

As will be understood, the nature of the control information will dependupon the type of memory architecture in the synchronous domain. As willalso be understood, the data tokens may include dummy tokens where onlyspecific words in a block of memory are to be written. These dummytokens may be included in the bursts and may be identified, for example,by a mask bit associated with each of the tokens.

The 32-bit wide datapath includes a multi-stage buffer queue 2306 whichreceives and transfers the data tokens generated in the asynchronousdomain from one stage to the next according to the delay-insensitivehandshake protocol described above. Although buffer 2306 is shown having24 stages, i.e., being capable of accommodating 24 data tokens, it willbe understood that according to various embodiments, the length of thisbuffer may vary. As the transfer of each data token into buffer 2306 isachieved, completion of the transaction for each of the bits is signaledbackwards by the first stage of buffer 2306 in accordance with thehandshake.

The datapath also includes a plurality of asynchronous-to-synchronous(A2S) datapath transfer units (one for each bit of the data token)represented by DTU block 2308. As will be described, DTU 2308 effectsthe transfer of each data token to synchronous domain 2304 in responseto an A2S “go” signal and the clock signal (CLK) associated withsynchronous domain 2304. The manner in which the A2S “go” signal isgenerated according to a specific embodiment of the invention isdescribed below.

In response to the indication that each of the bits of a token has beensuccessfully transferred to buffer 2306 (i.e., the completed handshakefor each bit), completion block 2310 generates a 1of1 transfer tokenrepresenting the completed transfer. According to a specific embodiment,completion block 2310 employs a pipelined architecture to minimize theimpact of the latency inherent in generating a single transfer tokenfrom the completion signals for each of the bits of the data token. Aspecific implementation of such a completion block is described below.

According to a specific embodiment, buffer 2306 is implemented as aseries of asynchronous stages each of which receives and transmits one32-bit data token at a time via intervening buffer channels using thefour-phase asynchronous handshake described above. According to an evenmore specific embodiment, each buffer stage comprises 16 buffer elementsin parallel, each of which is responsible for receiving and transmittingtwo bits of the data using the handshake. As will be appreciated, thereare a number of ways in which buffer 2306 and its buffer stages may beimplemented without departing from the scope of the invention.

A transfer token is generated for every data token which is successfullytransferred to the buffer for the purpose of tracking whether there area sufficient number of tokens in the buffer for sending a burst.According to a specific embodiment, completion block 2310 employs apipelined architecture to minimize the impact of the latency inherent ingenerating a single transfer token from the completion signals for eachof the bits of the data token. More specifically, completion block 2310is implemented as a tree structure which generates the transfer tokenfrom a copy of the data token sent to buffer 2306. An example of such atree structure, including the circuit to copy the data token, is shownin FIG. 24.

Each buffer element 2402 receives and transmits two bits of data usingan asynchronous handshake protocol. Each buffer element also generates acompletion signal, e.g., a copy of the enable, when a successfultransfer has occurred. This completion signal (along with three othercompletion signals for adjacent buffer elements) is received by a 4-waytoken collection circuit 2404 which generates a single token when allfour completion signals are received. This token (along with threeothers generated by similar circuits 2404) are transmitted to a final4-way token collection circuit 406 which generates the transfer token inmuch the same way. The CSP for an exemplary 4-way token collectioncircuit which may be used in such an implementation is given by * [ < ∥i : 0 . . . 3 : L[i] ? > ; R ! ]. The CSP for an exemplary transferbuffer element which may be used in such an implementation is given by *[ L ? x ; R ! x , T ! ].

The transfer token is received by accumulator block 2312 which generatesa single synchronization token when a specific number of transfer tokenshave been accumulated indicating the presence of at least one burst ofdata in the buffer; e.g., if each data token is a single word of dataand a burst must be 8 words of data, a synchronization token isgenerated for every 8 transfer tokens received.

Synchronization buffer 2314 is simply a buffer which copies its inputsto its outputs but won't let the control information on control path2303, e.g., the address and write command, through until it receives thesynchronization token from accumulator block 2312 which indicates thatsufficient data are present in buffer 2306 to effect a write to theaddress identified by the control information. The control informationis then transmitted to A2S interface 2316 which may comprise a simplebuffer stage similar to the datapath transfer units of DTU block 2108and 2308 described above. Alternatively, A2S interface 2316 may beimplemented using something more elaborate such as, for example, A2Sinterface 2100 of FIG. 21.

According to a specific embodiment, the synchronization token generatedby accumulator block 2312 is distributed to the individual bufferelements of synchronization buffer 2314 using a pipelined treestructure, a portion of which is shown in FIG. 25. As with the treestructure of FIG. 24 (which essentially works the reverse function),tree structure 2500 minimizes the impact of the latency inherent indistributing copies of a single token to each of the buffer elements.

As shown in FIG. 25, a 4-way token copy circuit 2502 receives thesynchronization token and copies the token to each of a plurality ofsubsequent token copy circuits 2504 (which may have different numbers ofoutputs, e.g., 2-way, 3-way) until there are a sufficient number ofcopies to distribute to the individual buffer elements 2506 ofsynchronization buffer 2314. The CSP for an exemplary 4-way token copycircuit which may be used in such an implementation is given by * [ L ?; < ∥ ; : 0 . . . 3 : R [i] ! > ]. The CSP for an exemplarysynchronization buffer element which may be used in such animplementation is given by * [ L ? x , T ? ; R ! x ].

In any case, once the control information, e.g., a write request, hasbeen transmitted to the synchronous domain, the A2S “go” signal isasserted by synchronous control circuitry 2318 and, in response to thesuccessive clock signals, DTU block 2308 transfers a burst of datatokens to be written to the specified memory locations according to theprotocol by which the synchronous memory architecture is characterized.When the burst is complete, the “go” signal is deasserted.

FIG. 26 is a simplified diagram illustrating an exemplary “burst mode”interface 2600 for transferring data tokens from a synchronous domain2602 to an asynchronous domain 2604 according to a specific embodimentof the invention. In the embodiment shown, synchronous domain 2602comprises a synchronous memory architecture, and interface 2600 is theread interface for use with the write interface of FIG. 23. According tovarious other embodiments, S2A interfaces similar to interface 2600 maybe employed in any of a wide variety of contexts in which thesynchronous domain is required to transfer data in bursts of consecutivetokens.

As with write interface 2300 of FIG. 23, a 32-bit wide data path, i.e.,L[0 . . . 31], encoded using 1of2 encoding is assumed. However, it willbe understood that data tokens having any number of bits and encoded inmany different ways may be transferred from one domain to the otheraccording to the described embodiment. The datapath includes a pluralityof synchronous-to-asynchronous (S2A) datapath transfer units (one foreach bit of the data token) represented by DTU block 2606. As will bedescribed, DTU 2606 effects the transfer of each data token toasynchronous domain 2604 in response to an S2A “go” signal and the clocksignal (CLK) associated with synchronous domain 2304. The manner inwhich the S2A “go” signal is generated according to a specificembodiment of the invention is described below.

The 32-bit wide datapath also includes a multi-stage buffer queue 2608which receives and transfers the data tokens from one stage to the nextaccording to the delay-insensitive handshake protocol described above.Buffer 2608 is shown having 24 stages because in a particularembodiment, this provides space for three 8-token bursts of data.However, it will be understood that according to various embodiments,the length of this buffer may vary. As the transfer of each data tokenout of buffer 2608 is achieved, completion of the transaction for eachof the bits is signaled backwards in accordance with the handshake.

As with write interface 2300, control information, e.g., a read commandand address range, generated in asynchronous domain 2604 is nottransmitted into synchronous domain 2602 until there is sufficient roomin buffer 2608 to accept the expected burst of consecutive data tokens.According to one embodiment, the size of the bursts are constant.According to another embodiment, the size of the bursts vary and may bedetermined with reference to the control information. In any case,interface 2600 is configured to ensure that whatever the size of thedata transfer from the synchronous domain there is sufficient bufferspace to accommodate it.

According to a specific embodiment, this is achieved by keeping track ofthe number of tokens transferred out of buffer 2608 with completionblock 2610 which generates a transfer token for every data token whichis successfully transferred out of buffer 2608. According to a specificembodiment, completion block 2610 employs a pipelined architecture tominimize the impact of the latency inherent in generating a singletransfer token from the completion signals for each of the bits of thedata token. More specifically, completion block 2610 may be implementedas tree structure which generates the transfer token from the completionsignals generated by the asynchronous circuitry subsequent to the finalstage of buffer 2608. Alternatively, completion block 2610 may compriseits own buffer stage following buffer 2608. An example of such a treestructure is described above with reference to FIG. 24.

The transfer token generated by completion block 2610 is received byaccumulator block 2612 which generates a single synchronization tokenwhen a specific number of transfer tokens have been accumulatedindicating there is space in buffer 2608 for at least one burst of data;e.g., if each data token is a single word of data and a burst is 8 wordsof data, a synchronization token is generated for every 8 transfertokens received. The synchronization tokens generated by accumulatorblock 2612 are stored in a token buffer 2614 for application tosynchronization buffer 2616.

Token buffer 2614 is shown as being able to accommodate 3synchronization tokens at a time. This corresponds to the number of databursts which may be accommodated by buffer 2608. However, it will beunderstood that token buffer 2614 may vary in length along with buffer2608 without departing from the scope of the invention. {Are there anytiming assumptions associated with the length chosen for the tokenbuffer or is it dictated by the number of bursts which can beaccommodated by buffer 2608. It will also be understood that when theinterface is powered up, token buffer 2614 is fully populated withsynchronization tokens to reflect the fact that buffer 2608 iscompletely empty.

Synchronization buffer 2616 is simply a buffer which copies its inputsto its outputs but won't let the control information on control path2605, e.g., the address range and read command, through until itreceives the synchronization token from token buffer 2614 whichindicates that sufficient space exists in buffer 2306 to effect a readof data from the address range identified by the control information.The control information is then transmitted to A2S interface 2618 whichmay comprise a simple buffer stage similar to the datapath transferunits of DTU block 2108 and 2308 described above. Alternatively, A2Sinterface 2618 may be implemented using something more elaborate suchas, for example, A2S interface 2100 of FIG. 21.

As discussed above with reference to interface 2300, there are sometiming constraints in the circuitry of interface 2600. That is, forexample, interface 2600 is configured such that each timesynchronization buffer 2616 receives a synchronization token from tokenbuffer 2614 any data tokens in buffer 2608 have migrated far enoughtoward the end of the buffer such that there is sufficient space at thebeginning of the buffer to accommodate the burst of data precipitated bytransmission of the synchronization token. According to a specificembodiment, this may be achieved, at least in part, because of the speedwith which buffer 2608 transfers tokens from stage to stage.

According to a specific embodiment, each synchronization tokentransmitted from token buffer 2614 is distributed to the individualbuffer elements of synchronization buffer 2616 using a pipelined treestructure as discussed above with reference to FIG. 25.

In any case, once the control information, e.g., a read request, hasbeen transmitted to the synchronous domain, the A2S “go” signal isasserted by synchronous control circuitry 2620 and, in response to thesuccessive clock signals, DTU block 2606 transfers a burst of datatokens from synchronous domain 2602 to buffer 2608. When the burst iscomplete, the “go” signal is deasserted. Generation of such a “go”signal will be described below with reference to more specificembodiments.

More specific implementations of A2S and S2A interfaces will now bedescribed with reference to FIG. 27 et seq. In the subsequentdescription, an asynchronous channel refers to a 1ofN channel plus ahandshaking “enable” wire. The enable wire is identified by an “e”superscript. Communication on these wires happens according to theasynchronous four-phase handshake protocol discussed above. “Validity”refers to the state of the 1ofN channel. When one rail is high, thechannel is said to be “valid”. Otherwise, it is said to be “neutral” orinvalid. A “token” is an abstraction referring to the propagation ofvalid states from one asynchronous channel to the next in a system.

The converter designs described below also make use of a pair ofsynchronous handshaking signals (referred to as S^(o) and S^(i)) toimplement flow control. According to a specific embodiment illustratedin FIG. 27, the handshake protocol used is the following: On a risingclock edge, if both A and B are high, the receiver reads the data. If Ais high and B is low, the data channel contains an unread value, and thesender is waiting for the receiver to raise B. If A is low and B ishigh, the data channel is “empty”. The receiver has read any previousvalue and is ready for the next one. If A and B are both low, thechannel is empty and the receiver is not ready to read from the channelvalue.

The following abbreviations and notation are used to represent varioussignals, channels, and constants: CLK—Clock; Tclk—Clock period;S^(o)—synchronous handshake output signal; S^(i)—synchronous handshakeinput signal; A_(c)—PC 1of1 output channel; go—Control signal to the DTUarray indicating whether to transfer a token (either a synchronoussingle-rail broadcast or a 1of1 four-phase asynchronous channel); anden—Internal enable signal in a cell (sometimes en is also the enable toits input channels, sometimes not).

Each of the embodiments described below implement high-performanceconversion circuitry between clocked (synchronous) logic andlocally-handshaking, (asynchronous) logic. In the asynchronous domain,the transfer of data occurs on 1ofN rail channels, following afour-phase local handshaking protocol. In the synchronous domain,transfer of data happens according to timing relationships with thetransitions of a clock signal. Any circuit which mixes the twocommunication conventions inevitably introduces metastability to thesystem. Localizing that metastability to a single signal per data tokentransfer while maintaining low-latency, high-throughput transfers is anobjective of various embodiments described hereinafter.

The port interfaces of the Asynchronous-to-Synchronous (A2S) andSynchronous-to-Asynchronous (S2A) converters 2802 and 2804,respectively, are illustrated in FIG. 28. It should be noted that in thefollowing discussion all synchronous signals are assumed to besingle-rail. However, embodiments of the invention can triviallyaccommodate other synchronous signaling conventions (e.g. dual-rail ordifferential).

A simplified description of the behavior of A2S interface 2802 is asfollows:

1. An asynchronous token arrives on the L channel, indicated by all L₀ .. . L_(M-1) channels going valid.

2. On the next rising edge of CLK, if either S^(i) is high or if S^(o)is low, a transfer occurs (go to state 4). Otherwise,

3. The converter waits until a rising CLK transition when S^(i) is high.

4. The data value on L is read (enables go low, the L₀ . . . L_(M-1)data rails go neutral). On the falling edge of CLK, the value isasserted on R₀ . . . R_(N-1) and S^(o) is set high.

5. Operation returns to state 1. Until the next token arrives, on eachrising CLK edge, if S^(i) is high, S^(o) is set low on the subsequentfalling CLK transition.

This is a simplified description due to nonzero slack on the L channelinternal to A2S converter 2802. The precise timing relationship betweenthe L handshake and the converter's synchronized transfer is unknown(but can only happen at times earlier than those indicated above).

A similarly simplified description of the behavior of S2A interface 2804is as follows:

1. The R₀ . . . R_(M-1) channels all go neutral, and the converter waitsfor all R_(i) ^(e) enables to be high (indicating readiness to receive atoken). As long as at least one R_(i) ^(e) is low, S^(o) is set low onthe falling edge of CLK.

2. On the next rising edge of CLK, if S^(i) is high, a transfer occurs(go to state 4). Whether or not a transfer occurs, S^(o) is assertedhigh on the next falling CLK edge.

3. The converter waits until a rising CLK transition when S^(i) is high.

4. The data value on L₀ . . . L_(N-1) is written to the R channels (R₀ .. . R_(M-1) go valid, the enables transition low). Operation returns tostate 1.

The A2S interface and S2A interface designs described below implementthe above-described behavior. In addition, specific implementations ofthe described embodiments are characterized by the following properties.With regard to timing, various designs of the present invention impose aminimum of timing assumptions on all signals. Races exist only againstthe clock, and on synchronous inputs which are assumed to conform tospecified setup and hold times relative to the rising edge of CLK.Assuming all timing assumptions hold, metastability arises only at asingle point in the design. This metastability is resolved by a Seitzarbiter. ½ Tclk (minus epsilon) is allowed for metastability resolution.All synchronous outputs transition during some range [tO_(min),tO_(max)]following CLK+.

According to various embodiments, both S2A and A2S directions cansustain one transfer per clock cycle. The maximum latency penalty of theconversion is one clock cycle (relative to a synchronous-to-synchronoustransfer), suffered only in pathological cases. Completion of incomingA2S and outgoing S2A tokens is pipelined (with local DI handshakes) tokeep cycle times low.

According to various embodiments, minimized synchronization to CLKallows “overclocking”: correctness is maintained even as Tclk dropsbelow its minimal value (“nop” cycles are introduced via synchronoushandshaking). Assuming all timing races are met, the only possibility ofmetastability propagating beyond the arbiter is if the arbiter resolvesduring a period of one transition exactly Tclk/2 following CLK+.

The internal high-level organization of the A2S and S2A converters 2802and 2804 according to a specific embodiment is shown in FIG. 29. Eachinterface includes four high-level components:

1. Pipelined Completion (PC) 2902. The purpose of this component is toidentify and acknowledge an incoming (A2S) or outgoing (S2A) data token.This “completion” logic involves feeding the OR'd data rails of eachdata channel into a tree of C-elements, i.e., condensing these datarails into a single “data valid” signal. For all but single-channeltokens, this combinational logic tree introduces too muchforward-latency to sustain a high cycle rate. Therefore, according to aspecific embodiment, the incoming token is completed in a pipelinedmanner, buffering intermediate completion signals at each stage.

According to a specific embodiment, PC 2902 is identical for both A2Sand S2A converters of the same token size & type. It appears on theasynchronous side of each (i.e. at the input of the A2S, at the outputof the S2A).

2. Control Processes (CTRL) 2904 and 2906 (e.g., see FIG. 30). CTRLprocesses 2904 and 2906 are responsible for (1) issuing a “go” signal tothe datapath when both asynchronous and synchronous sides are ready fora transfer, (2) sequencing the asynchronous and synchronous handshakingsignals (A_(c) ^(d), A_(c) ^(e)) and (S^(i), S^(o)), and (3)synchronizing as necessary to CLK.

The control processes for the A2S and S2A designs (CTRL 2904 and 2906,respectively) are nearly identical. The only difference between A2S CTRL2904 and S2A CTRL 2906 is their reset state: A2S CTRL 2904's S^(o)signal resets low, while S2A CTRL 2906's S^(o) resets high. (The formerreflects the empty state of the synchronous output channel, the latterreflects the empty state of the S2A's asynchronous capture buffer.)

3. Datapath Transfer Units (DTU) 2908 and 2910 (e.g., see FIG. 31).Generally, the DTU unit is responsible for transferring a data tokenacross the synchronous/asynchronous boundary once a transfer (“go”)signal is received from the associated CTRL process. The A2S and S2Adatapath transfer units differ significantly. The details of each aredescribed below.

4. Datapath buffering 2912 and 2914. Both the A2S interface and the S2Ainterface require additional stages of asynchronous buffers betweentheir PC and datapath transfer units. The buffers either store datatokens prior to transfer (A2S buffer 2912) or prior to being consumed bysubsequent asynchronous circuitry (S2A buffer 2914). In both cases,timing assumptions are imposed on these buffer stages. Specifically, thebuffers are capable of passing tokens faster than the DTU units canconsume or produce them. Stated another way, the buffer array has nocritical cycles longer than the clock period.

Given the above high-level decomposition of A2S interface 2802, a moredetailed description of its operation can now be provided. Beginningfrom the asynchronous L input, a token (comprising N 1ofM channelsfollowing the four-phase handshake protocol) enters A2S converter 2802and is immediately copied to two branches: one into Pipelined Completion(PC 2902), and the other into datapath buffers 2912 preceding the A2SDTU array. PC 2902 condenses the token into a single 1of1 token throughseveral stages of logic, the number of stages depending on the size of Nand M. The 1of1 token (on the “A_(c)” channel in FIG. 29) is thenpresented to A2S CTRL process 2904 as a notification that anasynchronous token has arrived and is ready to be converted.

A2S CTRL process 2904 samples the state of the 1of1 A_(c) channel on thenext rising edge of CLK. Seeing that it contains valid data (A_(c) ^(d)asserted), it makes the decision whether to transfer the token to thesynchronous domain or not, depending on the states of the output channeland the synchronous “grant” (R^(e)) signal. If the R channel is empty(R^(v) low) or if the grant signal is high, A2S CTRL process 2904 willraise its request signal (R^(v)). If R^(e) is also high, CTRL 2904 willassert the “go” datapath signal to the DTU array indicating that thedatapath transfer units should acknowledge the asynchronous data tokenand latch the value to the synchronous R^(d) bits. By this time, theasynchronous token will have propagated through buffer 2912 and will beready for consumption by the array of DTUs 2908.

If, on the other hand, A2S CTRL process 2904 does not transfer thetoken, i.e., if R^(v) was high and R^(e) was low, then it will neitheracknowledge the A_(c) token nor assert “go”. On some subsequent clockcycle when R^(e) goes high (indicating the recipient has accepted thestale synchronous value on R), it will then transfer the asynchronoustoken as described above.

According to various embodiments, this operational description of A2Sconverter 2802 relies on several timing assumptions:

1. In order to maintain full-throughput transfers (i.e. one every clockperiod when neither side stalls), each asynchronous pipeline unit mustbe capable of completing its input and output handshake cycles in underone clock period. For example, it is the inability of a single-stage PCto complete a 32-bit datapath in a sufficiently short time whichnecessitates the pipelining of this unit.

Note that in particular the two branches on the input L path mustsatisfy this requirement when the pipelines are both at peak dynamiccapacity (steady-state condition) and at peak static capacity (followinga synchronous-side stall). The latter condition is more difficult tosatisfy, but must be if the converter is to promptly respond to the casethat Re goes high after several cycles of stalling low.

Also note that once this condition is satisfied within the A2Sasynchronous circuitry, no further timing assumptions must be imposed onthe asynchronous circuitry feeding into the A2S converter. Outside theA2S, the handshake can stall unpredictably for arbitrarily long times,and the A2S converter will always maintain correctness, convertingtokens at peak throughput whenever possible.

2. The A2S must be able to sample the A_(c) state at the rising edge ofCLK and then, if it decides to transfer, it must assert the “go” signalto all A2S_DTU elements, which then must latch the L data value to R^(d)bits, all within a single clock cycle, never exceeding some maximumoutput time. Given that the sampling of A_(c) relative to clockfundamentally requires a nondeterministic period of time to reach adecision (due to metastability resolution), this timing assumption mustbe verified under the worst-possible arbitration time. If the samplingwere ever to take longer than some critical amount (approximately half aclock cycle in this design), then the converter runs the risk ofviolating its max output time (or, more precisely, propagating ametastable state outside the A2S CTRL arbitration logic). This failurecondition must be treated as catastrophic, and the probability of such afailure must be characterized. From this, the MTBF (Mean Time BetweenFailure) of the A2S converter can be determined, given some assumptionsabout input/output stall conditions.

3. The A2S converter must never change its synchronous outputs (R^(v),R^(d)) too early following the rising edge of CLK. This is a standardsynchronous timing (“hold time”) constraint. The design presented heresatisfies this by conditioning all output changes on ˜CLK, i.e. as longas the hold times of the output synchronous circuitry are less thanTclk/2, there is no possibility of failure. There is no reason to moreaggressively optimize this minimum output time (in order to give theoutput synchronous circuitry more time for calculation within the clockcycle) since the design assumes a worst-case metastability resolutiontime of approximately Tclk/2. That is, the minimum possible max outputtime is also greater than Tclk/2.

In S2A converter 2804, the arrival of a token to transfer is indicatedby the synchronous-side's assertion of L^(v). S2A CTRL process 2906decides whether to grant a transfer or not by sampling the state of the1of1 A_(c) token at the rising edge of CLK. The presence of a token onA_(c) indicates space in datapath output buffer 2914 for an additionaltoken. In this case (when A_(c) ^(d) is set at the rising edge of CLK),S2A CTRL 2906 will set its L^(e) grant line high and acknowledge theA_(c) token. If both L^(e) and L^(v) go high, the “go” signal to thearray of DTUs 2910 is asserted to transfer the synchronous input valueto the asynchronous capture buffer.

As the output asynchronous circuitry consumes the converted tokenscaptured in S2A buffer 2914, copies are sent to Pipelined Completion(PC) 2902, becoming new A_(c) tokens. In this manner the total number ofA_(c) tokens are conserved in the system, representing the fixed tokencapacity of S2A converter 2804. If at any point the output asynchronouscircuitry stalls (stops draining buffer 2914), buffer 2914 fills up andno new A_(c) tokens are produced. The A2S CTRL process 2906 then lowersits grant (L^(e)) line and stops converting tokens until the outputlogic reads from R, producing an A_(c) token. Pictorial representationsthe reset condition, normal operation, and the asynchronous-side stallcondition are illustrated in FIGS. 32A-32C, respectively.

S2A converter 2804 must satisfy the same three general categories oftiming requirements described above with reference to A2S converter2802. Namely:

1. All asynchronous pipeline cells within the S2A converter must be ableto sustain clock period handshake cycles under all operating conditions.

In fact, the requirement on the asynchronous output buffer is even morecritical for the S2A converter than it is on the A2S converter's inputbuffer. In the A2S converter, if the input asynchronous buffering“stutters” somewhat when transitioning from a full (previously stalled)to a dynamic condition, at worst an unnecessary send-stall “no-op” cyclewill be introduced. In the S2A converter, however, if the output bufferscannot fully drain a single token in one clock cycle out of a fullreceiver-stall state, the S2A DTU array may not be able to transfer thegranted token when the S2A CTRL process thinks it can. The result wouldbe a lost or corrupted data token.

2. The S2A converter must be able to set its synchronous output signal(L^(e)) within some reasonable max output time in order to satisfy thesetup time of the input synchronous circuitry, even under the worst-casemetastability resolution time. This requirement is also imposed on theinternal go synchronous control broadcast to the S2A datapath; go mustnot transition too late into the clock cycle in order for the datapathunits to be able to transfer (or not) a token on the next clock cycle.

3. All synchronous outputs (L^(e), go) must not transition too early inthe clock cycle. As in the A2S converter, this requirement is satisfiedby conditioning changes on ˜CLK.

Implementation details of specific embodiments of the converter designsare given below. Some details of the circuits have been omitted forclarity. These include staticizers on the output nodes of all dynamiclogic, and extra reset circuitry which any practical implementationwould require. Both of these additions are straightforward to implement.The specifications of the units described below are given in CSP.

Pipelined Completion unit 2902 includes a validity detection element perinput channel. An example of such a circuit is PCS0 3300 of FIG. 33which has a 1of4 input. PCS0 3300 is followed by a log(N)-deep tree ofcells, an example of which is PCS1 cell 3400 of FIG. 34. PCS0 unit 3300implements the simple CSP specification:

*[ L?x; R!x, V! ]

According to a specific embodiment, the “R!x” output operation is donein a “slack-zero” manner, i.e., the L and R data rails are wiredtogether. When one of the L data rails goes high, a 1of1 token is senton V.

A four-input PCS1 unit 3400 implements the CSP specification:

*[ L[0]?, L[1]?, L[2]?, L[3]?; R! ]

i.e., it reads the 1of1 inputs from four PCS0 units, and then outputs asingle 1of1 token. An example with N=4 1of4 input channels (i.e., 8bits' worth of data) is shown in FIG. 35. The PCS1 units can be combinedin a tree structure to complete arbitrarily large datapaths. Largercompletion trees can be constructed in an analogous manner.

An exemplary CTRL process is shown in FIG. 36. The CSP specification ofeither of the A2S and S2A control processes is the following: S^(o) :=so := so_init_state; *[[#Ac & CLK -> a := 1 | ˜#Ac & CLK -> a := 0 ], [CLK -> si := S^(i)];  [ ˜a & (si | ˜so) -> xso := 0 [ ] else xso := 1],  [ a & (si | ˜so) -> Ac?  [ ] else -> skip ];  so := xso;  [˜CLK ->S^(o) := so]] || *[go := S^(i) & S^(o)]It should be noted that for the A2S CTRL process, “so_init_state” is 0;for the S2A CTRL process it is 1.

The “S^(o)” output maps to the R^(v) validity signal in the A2Sconverter. In the S2A converter, it maps to the L^(e) enable signal.Likewise, in the A2S converter the “S^(i)” is the input R^(e) and in theS2A converter it is L^(v). The assertion of S^(o) can be considered toindicate the presence of a token in the control process. For the A2Sconverter, it indicates that the converter has asserted a data token tobe consumed by the synchronous circuitry; for the S2A converter, itindicates that the converter is ready to consume a data token.

On each rising clock edge, the control process probes the inputasynchronous channel A_(c) and sets the internal variable “a” high ifthe channel is ready to be read. The process also latches itssynchronous input (S^(i)). If A_(c) has valid data (a), or if thesynchronous side is not ready (S^(i) low), then xso (to become S^(o)) isset high. If A_(c) does not have valid data (˜a) and the synchronousside is ready, then xso is set low. In all other cases, xso (S^(o)) isleft in its prior state.

If A_(c) has valid data and either S^(i) is high or S^(o) is low, theA_(c) token is consumed. This can happen when either S^(o) was asserted(indicating ownership of a token in the CTRL process) and S^(i) was high(indicating the consumption of that token on the clock cycle inquestion), or when S^(o) was not asserted (indicating that the CTRLprocess can accept an A_(c) token regardless of the synchronous side'sstate.) In this case of the logic, the process lowers the A_(c) ^(e)signal, it waits for A_(c) ^(d) to be de-asserted, and then itre-asserts A_(c) ^(e). In the circuit implementation given below, it iscritical that the A_(c) ^(d) go low in response to A_(c) ^(e) within theclock cycle; if it remains high on the next rising edge of CLK, then thecontrol process will duplicate the token. (The Pipelined Completiondesign outlined in the previous section satisfies this requirement.)

On the falling edge of the clock, the “so” internal state variable iswritten to the synchronous handshake output (S^(o)). Once high, S^(o)will stay high until S^(i) goes high.

In parallel to this process, the “go” signal is combinationallygenerated as the conjunction of S^(i) and S^(o). On any rising clockedge with S^(o) and S^(i) both high, the datapath sees an asserted “go”,and a data token passes from one domain to the other.

As shown in the embodiment of FIG. 36, the circuit implementation of thecontrol process includes five components: internal completion logic 3602responsible for sequencing the enable signal, arbitration logic 3604,S^(i) input latching circuitry 3606, S^(o) output control and statelogic 3608, and the A_(c) acknowledge logic 3610.

Central to the design of the control process is the internal enablesignal (“en”), which triggers the set (en high) and reset (en low)phases of the internal dynamic logic. “en” is not strictly synchronizedto CLK. It will cycle once per clock cycle, but “en” is sequenced by thevalidities if the internal signals, not CLK. A specific circuitimplementation of the internal completion logic 3700 is shown in FIG.37.

A specific implementation of an arbitration logic circuit 3800 for thecontrol process is given in FIG. 38. Arbiter 3802 shown in this circuitcan be any standard mutual exclusion element such as, for example, aSeitz arbiter or a QFR. The “a” variable is implemented as a dual-railsignal pair to allow the use of domino pull-down logic stages elsewherein the circuit. Doing so facilitates the synchronized-QDI (quasi-delayinsensitive) design style used throughout the converters.

For exemplary implementations of Seitz arbiter and QFR circuits, pleaserefer to C. L. Seitz, System Timing, chapter 7, pp. 218-262, Reading,Mass., Addison-Wesley, 1980, and F. U. Rosemberger, C. E. Molnar, T. J.Chaney, and T. P. Fang, Q-modules: Internally clocked delay-insensitivemodules, IEEE Trans. Computers, vol. 37, no. 9, pp. 1005-1018, September1988, respectively. The entire disclosures of each of these referencesare incorporated herein by reference for all purposes.

The {overscore (kc)} signal in this logic is used to disable thearbitration logic's clock grant signal (A_(g) ⁰) once the A_(c) ^(d)input wins the arbitration (A_(g) ¹ asserted). This must be done toprotect the rest of the circuit from seeing a glitch on A_(g) ⁰ in thecase that A_(c) ^(e) transitions negative while CLK and en are stillhigh.

A more specialized arbitration circuit 3900 which incorporates the{overscore (kc)} function into a variant of the Seitz arbiter is givenin FIG. 39. This design removes the need for an extra dynamic logicstage to generate a. However, elsewhere in the CTRL unit logic stages,wherever “a¹” might have been used with the more general design, theseries combination “A_(g) ¹ & CLK” must be included instead (i.e.,requiring an extra transistor).

The circuits in FIGS. 38 and 39 limit the metastability hazard to thecase that an arbiter output resolves exactly as CLK transitionsnegative. In the case that A_(g) ¹ wins, A_(g) ¹ transitioning high asCLK transitions low can cause an unstable voltage on {overscore (a¹)}(or whatever logic stage depends on A_(g) ¹). In the case that A_(g) ⁰wins, A_(g) ⁰ transitioning high as CLK transitions low can cause A_(g)⁰ to return low before it has completely pulled down {overscore (a⁰)}(or some other logic stage in the more specialized design.) In eithercase, the metastable condition propagates beyond the arbiter. Note thatif the arbiter were to resolve at some time past CLK transitioning low,then the metastable condition does not propagate: if A_(g) ¹ wins atsome point following CLK transitioning low, the transfer is simplydeferred until the next rising clock edge; if A_(g) ⁰ does not win byCLK transitioning low, A_(g) ¹ wins due to the CLK input's withdrawnrequest.

Thus the failure mode due to metastability is dependent on the timerequired for the CLK to transition low. Ensuring a fast slew rate forCLK's negative transition will help protect the circuits from thisfundamental hazard.

According to a specific embodiment, the S^(i) input signal is capturedusing an edge-triggered, self-disabling single-rail to dual-rail circuit4000 shown in FIG. 40. The en signal is used to set and reset the s_(i)^({0,1}) input rails and facilitates the use of asynchronousself-sequencing logic throughout the control process. Furthermore, thedesign relies on this latch's synchronizing relationship to the risingedge of CLK to keep the process from repeatedly cycling in a subtle caseof the logic (when the clock period is significantly slower than theA_(c) cycle time). The protection comes from the circuit's propertythat, once en transitions low and s_(i) resets, the s_(i) ⁰ and s_(i) ¹rails remain low until the next rising edge of CLK.

The s_(i) ^(v) signal encodes the validity of the s_(i) ^({0,1}) rails.It is used in the internal completion logic to allow safe,delay-insensitive sequencing of en.

A specific implementation of a S^(o) synchronous output control circuit4100 is shown in FIG. 41. Since the control process must know the valueof S^(o) from the prior clock cycle, an asynchronous state element(i.e., STATEBIT circuit 4102) is used. A specific implementation ofSTATEBIT circuit 4102 is shown in FIG. 42. The STATEBIT circuit providesautomatic sequencing of the s_(o) ⁰ and s_(o) ¹ signals over the unit'sinternal cycle. An alternative design might use an additional inputlatch of FIG. 40 to resample the state from the synchronous S^(o)signal, but such a design would require additional circuitry to completethe so terms in the sequencing of en.

The cross-coupled NANDs and output latch of FIG. 41 provide a safesynchronization of the asynchronous {overscore (s_(o))} terms, whichonly pulse low during the en rising edge phase of the control process.The cross-coupled NANDs convert the pulse to a persistent value, and theoutput latch restricts the S^(o) output from transitioning while CLK ishigh. Since only one of {overscore (s_(o) ⁰)} or {overscore (s_(o) ¹)}can transition low at a time, and can only transition low while CLK ishigh, S^(o) is set in an entirely synchronous manner.

Like the s_(i) ^(v) signal of the S^(i) input latch, the s_(o) ^(v)signal encodes the validity of the s_(i) state logic. Here, by includingthe xso and {overscore (xso)} terms in the signal's pull-up logic, theassertion of s_(o) ^(v) additionally implies that the cross-coupledNANDs have successfully captured the appropriate {overscore (s_(o)^({0,1}))} value.

A specific implementation of an A_(c) Acknowledge logic circuit 4300 isshown in FIG. 43. This circuit is a relatively straightforward dynamiclogic stage, encoding the “a & (si|˜so)->Ac?” expression of the cell'sCSP. When “a” is set (meaning “A_(g) ¹ & CLK”) and s_(i) ¹ or s_(o) ⁰ isasserted, the {overscore (ack¹)} rail is pulled low, causing A_(c) ^(e)to go low, acknowledging the A_(c) input token.

The {overscore (kc)} term is included in the A_(c) ^(e) sequencing toensure that it has disabled the arbiter's clock selection by this time(to avoid the potential glitch on A_(g) ⁰ when A_(c) ^(d) goes low inresponse to the falling edge of A_(c) ^(e)).

The s_(o) ¹ term is redundantly included in the {overscore (ack¹)}pull-down to prevent the repeated cycling scenario described above inthe S^(i) input latch section.

The ack^(v), like the s_(i) ^(v) and s_(o) ^(v) signals, encodes thecompletion state of this block of logic. When {overscore (ack⁰)} isselected, the ack^(v) is delayed until the falling edge of CLK byincluding CLK in the pull-up; when {overscore (ack¹)} is selected,ack^(v) additionally completes the A_(c) ^(e) negative transition.ack^(v) does not return low until A_(c) ^(d) has been withdrawn(completed by the A_(g) ¹ term in the {overscore (ack^(v))} pull-up) andA_(c) ^(e) has returned to its asserted state.

According to a specific embodiment, the A2S and S2A datapath transferunits (e.g., DTUs 2908 and 2910 of FIG. 29) are single-channel converterelements which transfer tokens based on the value of their synchronous“go” input at various phases of the clock period. In order to avoidmetastability hazards within these circuits, timing assumptions must bemade on the asynchronous handshake transitions. For example, when theA2S DTU sees an asserted “go”, it must also receive a token on its Linput during that clock cycle. Likewise, when the S2A DTU receives anasserted “go”, its R^(e) must be high and ready to transition low oncean R data rail is asserted. As discussed above, the high-levelarchitecture of the A2S and S2A converters ensures that theseassumptions are satisfied.

According to a specific embodiment, the A2S datapath transfer units havethe following CSP specification: *[[CLK]; [go -> L? [ ] else -> skip]; *[˜CLK -> [go -> R := #L? [ ] else -> skip]  ]]

This process transfers the asynchronous L input to the synchronous Routput on every cycle that “go” is asserted. The unit makes theassumption that go transitions high sometime following the falling edgeof CLK but sufficiently before next rising edge of CLK to satisfy thesetup time constraints of the recipient synchronous logic. When CLKtransitions high on a cycle when go is asserted, L is acknowledged.

A circuit implementation of an exemplary A2S data transfer unit 4400 fora single 1of2 input is shown in FIG. 44. The data bit latch of R istransparent when CLK is low and go is high. When go is low, R is keptlow to protect the output from transitioning unpredictably when Larrives.

In order to keep the circuit from repeatedly acknowledging L tokenswithin a single clock period, the L^(e) negative transition isconditioned on the rising edge of CLK, and the L^(e) positive transitionis conditioned on the falling edge of CLK.

In order to avoid metastability hazards in this unit, the assumption ismade that L^(v) will transition low soon after the falling edge ofL^(e). That is, L must not ever stall in a valid state. This can besatisfied if the A2S input buffer units follow a PCHB or PCFB templateas described in “Synthesis of Asynchronous VLSI Circuits,” by A.J.Martin incorporated herein by reference above.

According to a specific embodiment, the CSP specification of the S2Adatapath transfer unit is

*[[CLK]; [go -> R!L [ ] else -> skip ]; [˜CLK]]

Aside from the handshake with the R output channel, this unit isentirely synchronous in nature; specifically, on each clock cycle, onthe rising edge of CLK, it samples its inputs and evaluates some outputcondition. In this case, it checks if the “go” control signal from theS2A control process is set, and, if so, writes its L bit (or bits) tothe R output channel in a 1ofN rail encoding following the four-phasehandshake protocol. FIG. 45 shows an exemplary one-bit circuitimplementation of the S2A DTU 4500. This design can easily be extendedto support a two-bit input, with a 1of4 rail output data encoding.

According to a specific embodiment, the A2S converter requires at leasta single stage of buffering on the datapath, following the point that Lis copied to the pipelined completion (PC) circuitry. The need for thisis primarily due to performance considerations; i.e., in order to allowthe PC to operate in a pipelined manner, it must not be bottlenecked bytokens backing up in the datapath branch. Essentially, the datapath is“slack matched” to the control (and completion) path.

Another reason for buffering the asynchronous data at the input of theDTU array is to ensure that the input to the DTU elements have thecorrect handshake properties. Namely, the A2S DTU described above relieson its input resetting (returning to its neutral state) promptly afterthe falling edge of L^(e). This can be guaranteed by having a PCHB orPCFB buffer stage directly preceding the DTU array.

According to a specific embodiment, the S2A converter imposes a muchstricter requirement for additional buffering. It needs several bufferstages between its datapath output and its output PC, as well as on theA_(c) completion channel output of the PC. The A_(c) channel buffersinitializes “filled”, i.e. with a number of tokens corresponding to theamount of slack available in the datapath (minus one token with whichthe S2A control process initializes out of reset.)

At least two tokens must be present in the S2A datapath-to-completionloop in order to support a transfer on every clock cycle. One token isconsumed by the S2A control process and DTU elements during a transfer.Since the asynchronous portion of the loop has non-zero latency, asecond token must always be present in that branch in order to pipelinethe transfers.

According to specific embodiments, both the datapath and completionbranches have sufficient buffering to absorb the two tokens in the loop.If the datapath buffer capacity is insufficient, the S2A DTU outputhandshake will stall if the S2A's R output stalls, potentially causingmetastability hazards in the datapath or lost tokens. If the completionpath buffer capacity is insufficient, data tokens will be trapped in theoutput buffer when the synchronous side stalls. In this case, the S2Aconverter will not output a received R token until the next token isreceived by the converter, which may take an arbitrarily long amount oftime.

A final performance-related factor influences the loop token (andtherefore buffering) requirements of the S2A converter. When the forwardlatency through the PC becomes too great, additional tokens must bepresent in the loop to keep the pipeline at peak capacity.

The internal high-level organization of A2S and S2A converters 4602 and4604 according to an alternate embodiment is shown in FIG. 46. Eachinterface includes four high-level components:

1. Pipelined Completion Stage (PCS) 4606. This component is identical tothe PC unit described earlier, although 4606 is drawn such that itincludes the datapath copy circuitry described in PCS0 circuit 3300.

2. Control Processes (CTPs) 4608 and 4610. The CTP is responsible for(1) issuing a “transfer” signal to the datapath when both asynchronousand synchronous sides are ready for the transfer, (2) sequencing theasynchronous and synchronous handshaking signals (A^(d), A^(e)) and(S^(i), S^(o)), and (3) synchronizing as necessary to CLK. CTP_A2S 4608and CTP_S2A 4610 share many circuit elements and have the same portinterface, but are not identical. Details of each design, highlightingcommon functionality, are given below.

3. Datapath Transfer Units (DTU) 4612 and 4614. Generally, the DTU unitis responsible for transferring a data token across thesynchronous/asynchronous boundary once a “transfer” (go) token isreceived from the CTP. In the DTU_A2S case, the unit latches anasynchronous 1ofN data token to the synchronous side at a timeacceptable to the synchronous clocking discipline (specifically sometime after the falling edge of CLK and before the next rising edge). Inthe DTU_S2A case, the unit samples the synchronous input on the risingedge of CLK and converts the value to an asynchronous one-hot token oncethe asynchronous channel is ready (enable asserted).

4. Pipelined Datapath Broadcast (PDB). An exemplary implementation of aPDB 4700 is shown in FIG. 47. This unit implements the complementaryfunction of the PCS. That is, it distributes a single “transfer data”(go^(d)) signal to each DTU in the datapath. In this case, the Nbackward-going enable signals feed into a log(N)-deep C-element tree togenerate the final go^(e) signal. Pipelining the completion adds someforward latency the go^(d) broadcast, but allows the handshake cycletime to stay low.

According to one embodiment, S2A converter 4604 additionally requiresextra asynchronous buffering stages 4618 between its datapath output andits output PCS, and on the “A” channel output of the PCS (i.e., buffer4620). These provide a guarantee that any transfer initiated by an “A”token can be absorbed in the datapath if the environment is not preparedto read.

Exemplary implementation details of converters 4602 and 4604 accordingto specific embodiments are given below. Some elements of the designshave been omitted for clarity. These include staticizers on the outputnodes of all dynamic logic, and extra reset circuitry which anypractical implementation would require. Both of these additions arestraightforward to implement. Other circuit details (particularly of thecontrol processes) are not covered since there are many differentsolutions, and all are fairly straightforward applications of the designstyle described herein.

An exemplary CSP specification of A2S control process 4608 is thefollowing: *[[ #A & CLK -> a := 1 | ˜#A & CLK -> a := 0 ], [CLK -> si :=S.i]; [ a | ˜a & ˜si & x -> x′:= 1 [ ] else -> x′:= 0 ], [ a & (si | ˜x)-> go!, A? [ ] else -> skip ]; x := x′, [˜CLK -> S.o := x′]]On each rising clock edge, the process probes the input asynchronouschannel A and sets the internal variable “a” high if the channel isready to be read. The process also latches its synchronous input (S^(i),which indicates if the synchronous side is ready to receive data on thatclock cycle). If A has valid data, or if the synchronous side is notready to receive data and the synchronous datapath output holds anunread value (“x” high), then “x′” is set high. The “x′” variable setsthe state of the synchronous datapath output channel (“x”) on the nextclock cycle. If the asynchronous channel A contains valid data(indicating the presence of an input data token to the datapath), and ifeither the synchronous side is ready to receive data or if thesynchronous datapath output channel is empty (“x” low), then A is readand a “transfer” token (go^(d)) is sent to the datapath.

On the falling edge of the clock, the “x′” variable is written to thesynchronous handshake output (S^(o)). This signal encodes the state ofthe datapath output to the synchronous logic: if it is high, a new datavalue is sitting on the wires. Once high, S^(o) will stay high untilS^(i) goes high. On any rising clock edge with S^(o) and S^(i) bothhigh, a data token passes from A2S to the synchronous-side logic.

According to a specific embodiment, S2A control process 4610 is somewhatsimpler since it does not need to store the state of the synchronousdatapath channel: *[[ #A & CLK -> a := 1 | ˜#A & CLK -> a := 0], [CLK ->si := S.i]; [ a & si -> A? [ ] else -> skip ], [ x & si -> go! [ ] else-> skip ]; x := a, [˜CLK -> S.o := a]]In this case, S^(i) is a synchronous request to transfer a data token;S^(o) grants the transfer, indicating to the synchronous side that theoutput (R) asynchronous channel is empty.

Implicit in the design of these control processes is the internal enablesignal (“en”), which triggers the set (en high) and reset (en low)phases of the internal dynamic logic. “en” is not strictly synchronizedto CLK. It will cycle once per clock cycle (except in the case that acycle is missed due to a maximum arbiter resolve time), but “en” issequenced by the validities if the internal signals, not CLK (asillustrated in FIG. 48).

Several structural similarities between the two control processesdescribed above are evident from their CSP descriptions. From the firstline of each loop, an arbitrated select, it's clear that the samearbitration logic is used in both. A particular implementation of sucharbitration logic 4900 is shown in FIG. 49. According to variousembodiments, arbiter 4902 shown in this circuit can be any standardmutual exclusion element such as, for example, a Seitz arbiter or a QFR.The “a” variable is implemented as a dual-rail signal pair to allow theuse of domino pull-down logic stages elsewhere in the circuit. Doing sofacilitates the synchronized, quasi-delay-insensitive design style usedthroughout the converters.

The circuit in FIG. 49 limits the metastability hazard to the case thatan arbiter output resolves exactly as CLK goes low. In the case thatA_(g) ¹ wins, the rising edge of A_(g) ¹ as CLK goes low can cause anintermediate on {overscore (a¹)}. In the case that A_(g) ⁰ wins, therising edge of A_(g) ⁰ as CLK goes low can cause A_(g) ⁰ to return lowbefore it has completely pulled down {overscore (a⁰)}. In either case,the metastable condition propagates beyond the arbiter. Note that if thearbiter were to resolve at some time past CLK going low, then themetastable condition does not propagate: if A_(g) ¹ wins at some pointfollowing the falling edge of CLK, the transfer is simply deferred untilthe next rising clock edge; if A_(g) ⁰ does not win by the falling edgeof CLK, A_(g) ¹ wins.

According to a specific embodiment, both control processes also share aninternal state variable, “x”. The A2S circuit sets this state based onan intermediate variable “x′”, a logical expression of its inputs; theS2A circuit sets it directly from the arbiter component output z“{overscore (a)}” (in this case, x′:={overscore (a)}).

According to a specific embodiment, both control processes use the samestate variable to set their synchronous output signal, S^(o). FIG. 50illustrates an exemplary combined statebit-to-synchronous-latch circuit.The “xv” signal shown in the diagram encodes the validity of the “x”variable (plus the following RS latch), needed for subsequent completion(i.e., “en” control). This combination of a dynamic pull-down stage({overscore (x)}) followed by an RS flip-flop, followed by a clockedlatch, plus the associated {overscore (xv)} validity circuit, provides aconvenient asynchronous-to-synchronous circuit fragment when the timingof {overscore (x)} is sufficiently restricted to ensure stability of theoutput clocked latch. Specifically, {overscore (x)} cannot go valid tooclose to the falling edge of CLK. This condition is satisfied in theCTP.

A final shared component of the designs according to a particularembodiments is the handling of the control processes' synchronous inputS^(i). To minimize the hold time requirement on the signal, theedge-triggered, self-disabling single-rail to dual-rail (S2DE) latch5100 shown in FIG. 51 may be used. The S2DE latch provides asufficiently safe synchronous-to-asynchronous conversion when it ispossible to ensure that the rising edge of en will never coincide withthe falling edge of CLK, which could cause a metastable condition ons_(I) ^({0,1)). This requirement establishes the following timingconstraint on the rising edge of en: given the latest time into a clockcycle that “a” may be set (the event which triggers all other sequencingin the processes), the rising edge of en must occur before the followingcycle's CLK negative transition. The case of en going low then highbefore the falling edge of CLK of the transfer cycle must also beprohibited, but this can easily be ensured by conditioning the fallingedge of en on the falling edge of CLK. The latest “a” may be set is atthe falling edge of CLK (maximum arbiter resolution case), so the CTPhas a maximum of one clock cycle to complete the en cycle.

A more robust latch design (e.g., latch 5200 of FIG. 52) can be used toeliminate any potential metastability on s_(i) ^({0.1}) at the expenseof extra transitions on its handshake cycle and an additional arbiter.These extra transitions can be hidden by inserting a buffer stage 5202(slack ½) between the central control process and the S2DE latch 5100.

The remaining details of these particular implementations of CTP_A2S4608 and CTP_S2A 4610 can be implemented in a variety of ways accordingto various embodiments following the general production rule synthesistechniques of the quasi-delay-insensitive design style described in“Synthesis of Asynchronous VLSI Circuits,” by A.J. Martin incorporatedherein by reference above. This flexibility arises from differentreshuffling possibilities of the A? and go! handshaking expansions, andfrom different transition completion strategies. Finally, internaltiming races may be introduced to simplify and/or speed up the circuits.

According to a specific embodiment, the A2S datapath transfer units 4612have the following CSP specification:

*[ L?1, go?; [˜CLK]; R:=1 ]

L is the asynchronous input channel from the PCS, “go” is the channelfrom the CTP indicating that a transfer should occur. The DTU_A2S readsfrom the L and go channels, waits for CLK to be low (note it may alreadybe low), and then outputs the data value to the synchronous R output. Aslong as the forward latency of go^(d) through the PDB is minimal, andassuming the PCS is properly slack-matched (as it is in theimplementation discussed above), the behavior of the CTP_A2S guaranteesthat the L and go channels will both go valid during some bounded rangesurrounding the falling edge of CLK. The upper end of this range,accounting for the additional R:=1 latency of the DTU_A2S and the setuptime on the R output signal, imposes an important lower bound on tau.

A specific circuit structure which implements the above CSPspecification is given in FIG. 53. The {overscore (x)}-to-R latch and{overscore (xv)} circuitry is identical to that used for the S^(o)signal in the CTP circuits. The timing constraint on {overscore (x)}(that it not go valid too close to the falling edge of CLK) is satisfiedhere.

The CSP specification of a particular implementation of S2A datapathtransfer unit 4614 is given by:

*[ [˜CLK];[CLK]; x:=L] ∥ *[ go?; R!x ]

This implementation includes two parallel processes: one which capturesthe synchronous input L on every rising clock edge (and converts thesingle-rail data format into a 1ofN rail format), and another whichwrites the value to the asynchronous output channel (R) once a “go”transfer token is received. In the case that N is 4, the first processcan be implemented using the S2Q sampler circuit 5400 shown in FIG. 54.S2Q circuit 5400 captures the values of its two synchronous inputs onevery rising edge of CLK, and outputs their combined value on a 1of4channel, x. x transitions through its all-low state immediatelyfollowing the rising edge of CLK before asserting the selected datarail. Similar circuits for N other than 4 can be implemented by changingthe input combinational logic.

The second process in this DTU_S2A implementation is the circuit 5500shown in FIG. 55. It is a WCHB stage (see “Synthesis of AsynchronousVLSI Circuits,” by A.J. Martin incorporated herein by reference above)modified to accommodate its unstable x input. It treats x as anunacknowledged input, and writes its output R once go and x are valid.The inclusion of R₁ ^(e) in the pull-down logic (a departure from theWCHB template) provides some protection if R^(e) and R₁ ^(e) do nottransition low before the next validity phase of x (i.e., some timeafter the next rising edge of CLK), which might otherwise result in theassertion of multiple R rails. Doing so imposes less rigidsynchronization of the transfer cycle to CLK.

The x^(i) rails can be excluded from the {overscore (R^(i))} pull-upnetworks (another departure from the WCHB template) since the designguarantees that the R₁ ^(d) low-to-high transition occurs during a rangeof the clock cycle surrounding the falling edge of CLK, excluding therising edge of CLK. As long as the minimum time between the rising edgeof CLK and the rising edge of R₁ ^(d) is longer than the maximum x resettime (a timing assumption of the design), the unacknowledged x inputposes no threat of destabilizing R.

A specific implementation of the pipelining of the “go” channelbroadcast to the datapath is illustrated in FIG. 47. According to aspecific embodiment, BUF element 4702 is a 1of1 channel PCHB buffer. Fora 16-node broadcast tree, four transitions are added to the rising edgeof go^(d) broadcast. In return, the CTP go^(d) positive transition isacknowledged in three transitions instead of a minimum of seven, and therising edge of go^(e) can follow the falling edge of go^(d) in a singletransition compared to a minimum of seven. Thus the pipelining saves 10transitions to what would otherwise be the critical cycle of the design.

As mentioned above and according to various embodiments, S2A converter4604 of FIG. 46 requires extra asynchronous buffering stages 4618between its datapath output and its output PCS, and on the “A” channeloutput of the PCS (buffer 4620). According to one such embodiment, the“A” channel buffers must initialize “filled”, i.e., with a number oftokens corresponding to the amount of slack available in the datapath.This slack is defined by the number of data tokens a DTU plus its outputbuffers plus the PCS can hold before the go^(e) signal would stall,i.e., not transition high following the falling edge of go^(d).

According to a more specific embodiment, at least one unit of slack (twohalf-buffer stages) is needed between the DTUs and the PCS to ensurethat the PCS can never issue an “A” token when its subsequent transferat the datapath might stall. Specifically, validity on the DTU outputchannels should not by itself result in an “A” token to be generated,since the R^(e)'s of the DTUs (implied directly by the environment) maystall high. If a DTU's R^(e) stalls high, its go^(e) into the PDB stallslow. In this case, the CTP's go^(d) transaction will not complete withina clock cycle, which the CTP_S2A specification above assumes.

According to various embodiments, when the outer-loop forward latency(i.e., rising edge of CLK to rising edge of A^(d)) becomes too great(inevitable with large word sizes), an additional unit of slack can beadded to the DTU R channels and the A channel (with anotherinitialization token). Doing so amortizes the outer loop latency overtwo clock cycles. The benefit of additional slack on these channelsdiminishes as the backward “hole” propagation latency becomes thecritical cycle, incurred when the environment drains the outermost tokenin a full (previously stalled) state.

According to various embodiments, the A2S and S2A converterarchitectures described above can be adapted to handle burst-modetransfers. It will be understood that, although one specific category ofprotocols is used herein to illustrate a specific implementation,various embodiments of the invention support a wide variety of bursttransfer protocols. The following definitions are useful forunderstanding the described burst mode implementations.

Burst Transfer: A conversion (either A2S or S2A) allowing more than onedata token transfer per request/grant control handshake. For example, inimplementations described above, one data token is transferred per clockcycle with both S^(i) and S^(o) high. By contrast, a burst transfermight transfer two, four, or any number of data tokens per clock cyclewith both S^(i) and S^(o) high. According to a particularimplementation, a maximum burst size is imposed, a constant referred tobelow as MAX_LEN.

Message: A sequence of data tokens. According to a specific embodiment,each data token has an associated tail bit which is zero on every dataphase except the last token in the sequence. In a particularimplementation described below, a message may be of arbitrary length,and the converters segment the message into bursts of lengths up toMAX_LEN. FIG. 56 is a timing diagram which serves to illustrate such animplementation in which a message comprising data tokens D0-D9 issegmented into 3 bursts.

Pipelined Burst Transfer: A burst transfer protocol which allows therequest/grant control phase of a burst transfer to take place during thetransfer of a prior burst's message. The number of messages that thereceiver will grant in advance of the communication of those messages isreferred to herein as the number of grant control tokens in thesender-receiver loop. According to various embodiments, an arbitrarynumber of such tokens may be supported. A particular implementationimposes some finite maximum number of such outstanding granted bursts, aconstant referred to as MAX_GRANT. FIG. 57 includes two timing diagrams,a first illustrating the signal timing for non-pipelined 3-word bursttransfers, and a second illustrating signal timing for pipelined 4-wordburst transfers.

Because a benefit of burst transfers arises from the receiver being ableto commit to a sustained acceptance of data tokens, and thereforeimplying some finite amount of available buffer space, a limit to thelength of each message is established (MAX_LEN). According to variousembodiments, the message length may be fixed (e.g., as shown in FIG. 57)or, alternatively, messages can be allowed to have a variable length upto MAX_LEN.

A specific embodiment of a burst-mode converter designed according tothe invention employs a message tail bit to support variable-lengthmessages. Alternative embodiments employ other mechanisms for encodingvariable message lengths (e.g., a burst count sent during the controlhandshake phase, or included in-band as a header word of the message).Alternative implementations eliminate such mechanisms where messagesizes are fixed.

In order to support burst transfers, the A2S design described above withreference to FIGS. 46-55 includes two additional cells. Otherwise thegeneral architecture is similar to that described above. FIG. 58 is ahigh level diagram showing such a burst mode A2S converter (BURST_A2S)5800. The two new cells are burst complete logic (BURST_COMPLETE) 5802and burst repeat cell (BURST_REPEAT) 5804.

According to one embodiment, burst complete logic 5802 is a simpleasynchronous delay-insensitive logic unit with the following CSPspecification: BURST_COMPLETE ==   i := 0;   *[Ac?, T?t;    [ ˜t -> i :=(i+1) % MAX_LEN    [ ] t -> i := 0 ];    [i == 0 -> Bc! [ ] else -> skip  ]]The unit reads an input tail token per pipelined completion token fromthe datapath and whenever the tail token is “1,” or when it has receivedMAX_LEN tokens without a “1” tail token, it sends a 1of1 “BurstCompletion” token to its Bc output channel.

According to various embodiments, the implementation of the burstcomplete logic varies in relation to the value of MAX_LEN. For example,for small values of MAX_LEN (e.g., 2 to 4), the cell can be implementedin a single pipeline stage with internal state bits. For larger values,the cell may be decomposed into separate stages for incrementing the “i”internal variable and for comparing and reinitializing “i.” A specificcircuit implementation of the burst complete logic is described belowfor a fixed-size message variation of a burst mode A2S converter.

The burst repeat cell extends the “go” signal pulse to the datapath overseveral clock cycles corresponding to the length of the burst. Accordingto various embodiments, the burst repeat cell may have the followingspecification: BURST_REPEAT ==  g := 0, bcnt := 0;  *[[CLK]; xgo := go,t := T;   [ xgo & ˜t & (bcnt != MAX_LEN−1) -> g := g+1   [ ] ˜xgo & g>0& (t | (bcnt = MAX_LEN−1)) -> g := g−1   [ ] else -> skip   ];   [ ˜t &(xgo | g>0) -> bcnt := (bcnt+1) % MAX_LEN   [ ] t -> bcnt := 0   [ ]else -> skip   ];   [˜CLK];   ]  ||  *[bgo := go | g>0]According to specific embodiments, this cell may be implemented in astraightforward manner by applying standard synchronous designtechniques. According to such embodiments, all of the cell's inputs andoutputs are synchronous; that is, inputs are sampled on the rising edgeof the clock and outputs (including state variables) can be expressed ascombinational functions of its inputs (e.g., either registered, as for“g” and “cnt”, or not, as for “bgo”).

According to one embodiment, the burst repeat cell implements twocounters: one tracking the number of outstanding bursts that have beengranted by the control process (e.g., “g” incremented every time “go” isasserted within an active burst), and one tracking the length of thecurrent active burst (e.g., “bcnt”). According to this embodiment, the“g” counter is required to support pipelined burst grants, and the“bcnt” counter is required to support segmentation of messages greaterthan MAX_LEN into multiple bursts. According to various implementations,the burst repeat cell may be simplified by eliminating either or both ofthese features.

Another difference in the burst mode A2S converter from the baseline A2Sconverter described above (e.g., A2S converter 2802 of FIG. 29) is theamount of internal buffering on the LD and LT channels. That is, in aparticular implementation of the baseline A2S converter, only a singleinput buffer (i.e., static slack 1) is required for correctness,although various implementations might use more for performanceoptimization (e.g., slack-matching to the control-to-datapath branch.)By contrast and according to a specific embodiment of the invention, forthe burst mode A2S converter a single message (e.g., of MAX_LEN words)of buffering is required for correctness. Since now the control path isused only once per message, slack matching to it is less of aconsideration.

According to one embodiment, two requirements are placed on the messagebuffer:

1. It must be able to receive and source tokens once per clock cycleregardless of how full it is.

2. The forward latency through the buffer must be less than the controlpath to DTU latency.

In an alternate embodiment which supports pipelined burst transfers,additional messages' worth of buffering are added. In general, in orderto support the requesting of N burst transfers in advance, (N+1)*MAX_LENamount of buffer space is provided.

It should be noted that as buffer slack needs increase, a linear arrayof PCHB/PCFB/WCHB buffers may become an inappropriate choice due to areaimplications and to difficulty satisfying the above timing constraints.Therefore, various such embodiments may employ a dual-ported FIFO memoryimplementation of this buffering.

According to a specific embodiment, the burst mode A2S converter of FIG.58 has two asynchronous input channels: the datapath LD channel (acollection of N 1ofM channels), and an LT tail bit 1of2 channel. Eachdata token received by the burst mode A2S converter on LD is accompaniedby a tail token on LT. For each data token except the last in a message,the LT token is “0.” On the last data token, the value of LT is “1.” Thetail bit is also added to the burst mode A2S converter's synchronousoutput interface as another data bit (denoted “R.t” in the figure). AsLD data words enter the burst mode A2S converter, they are copied to twounits: message buffer 5806 and pipeline completion (PC) unit 5808.Message buffer 5806 stores the token until the array of data transferunits (DTUs) 5810 is ready to transfer it. PC unit 5808 consumes the LDtoken and outputs a single 1of1 Ac token to burst complete logic 5802.

The LT tail token associated with the LD data token is also copied tothese two units: one copy is buffered with the data, and the other issent to burst complete logic 5802.

Upon receiving both Ac and LT tokens, burst complete logic 5802 eitherincrements its internal burst counter (if LT is “0” and the counter hasnot reached MAX_LEN), or else it sends a 1of1 token on its Bc outputchannel to A2S control unit 5812 (if LT is “1” or its counter hasreached MAX_LEN).

Control unit 5812 handles the “Bc” token just as it handles the “Ac”token in the non-burst A2S design. Namely, it asserts its “S^(o)”handshake signal synchronously with CLK, waiting for its “S^(i)” inputto be asserted. Once both are asserted, it asserts its output “go”signal to the datapath for a single clock cycle. In the burst mode A2Sconverter design, the assertion of “go” represents the granting of anentire burst (up to MAX_LEN tokens) of data, rather than the single wordit granted in the non-burst A2S design.

Burst repeat cell 5804 provides the appropriate translation between theper-burst “go” signal from control unit 5812 and the per-word “bgo”control from datapath. Simply stated, it extends the single-cycleassertion of “go” over a number of cycles matching the length of theburst. In order to know when to end the burst, burst repeat cell 5804both watches the output tail bit (i.e., R.t) and maintains an internalcounter in case the message is longer than MAX_LEN (i.e., in case itmust terminate the burst and continue the message over a subsequentburst when “go” is reasserted). According to an embodiment in whichpipelined burst transfers are support, burst repeat cell 5804 alsoincrements a grant counter whenever it sees an asserted “go” fromcontrol unit 5812 while a burst is still in progress.

The modifications to the A2S design in order to implement the burstprotocol described above may be applied in a symmetrical manner to theS2A converter. FIG. 59 is a high level diagram of such a burst mode S2Aconverter 5900 designed according to a specific embodiment of thepresent invention. According to a more specific embodiment, burstcomplete logic 5902 and burst repeat cell 5904 may be implemented asdescribed above.

Burst mode S2A converter 5900 has the same synchronous interface as thebaseline S2A converter (e.g., S2A converter 2804 of FIG. 29) with theaddition of an L.t tail bit which can be considered an additional bit ofdata, its state serving to segment the data sequence on L.d intomessages. The asynchronous output interface also remains unchangedexcept for the addition of the output tail bit, RT (a 1of2 channel). Thedata output channel “R” in the baseline S2A becomes “RD” in the burstmode S2A converter.

For every burst grant cycle negotiated by L.e and L.v, up to MAX_LENdata tokens are transferred by the burst mode S2A converter's array ofDTUs 5906. The extension of the “go” signal of control unit 5908 overmultiple clock cycles corresponding to the length of each burst ishandled by burst repeat cell 5904 in a manner similar to that describedabove with reference to burst mode A2S converter 5800. If the L.t bitstays low for MAX_LEN cycles, burst repeat cell 5904 terminates theburst, requiring an additional grant control token to be negotiated(which may have happened concurrently if the implementation supportspipelined grants).

As bursts are collected in data buffer 5910 their words are completed to“Ac” 1of1 tokens which are then further completed by burst completelogic 5902 to “Bc” 1of1 tokens. The “Bc” tokens are returned to controlunit 5908 indicating that a burst's worth of buffer space has drained.According to a specific embodiment, the burst mode S2A converter designinitializes with its data buffer completely empty and the “Bc” channelinitialized with as many tokens as data buffer 5910 and burst repeatcell 5904 will support. The number of initial “Bc” tokens greater thanone corresponds to the number of pipelined grant tokens control unit5908 will issue. Thus, the internal “g” counter of burst repeat cell5904 must support counts up to this number.

According to specific embodiments in which all burst messages are of afixed length MAX_LEN, there is no need to include a tail bit in thedesign. In such embodiments, the fixed-length burst mode A2S converteralways sends MAX_LEN tokens per transfer, and the recipient synchronouslogic counts the number of tokens transferred following a cycle withboth S^(i) and S^(o) asserted to know when the message ends.

Likewise, the synchronous logic feeding a fixed-length burst mode S2Aconverter always provides valid data for MAX_LEN cycles beginning from acycle with both S^(i) and S^(o) asserted (or following the end of theprior transfer when the control phase is pipelined). The asynchronousrecipient logic then counts the data tokens it receives to know whenmessages begin or end.

Block diagrams of these simpler burst converter designs are given inFIGS. 60 and 61 According to specific embodiments, fixed burst completelogic 6002 and 6102 are simply token counters which may be implementedas follows: FIXED_BURST_COMPLETE ==  i := 0;  *[Ac?; i := (i+1)%MAX_LEN;  [ i == 0 -> Bc!   [ ] else -> skip   ]]The remainder of the converter blocks operate as described above withreference to FIGS. 58 and 59.

An example implementation of fixed burst complete logic when MAX_LENequals two is given in FIG. 62 (DECIMATE2_(—)1of1). When MAX_LEN is anypower of two (2^(N)), a cascade of N DECIMATE_(—)2_(—)1of1 units may beused to implement the fixed burst complete logic. When MAX_LEN is not apower of two, or when the forward latency through cascadedDECIMATE2_(—)1of1 units becomes unacceptably high, a more generalcounter design may be used.

The fixed-length burst mode A2S and S2A converter designs may use theburst repeat cell described above by simply wiring the “T” tail bitinput to logic zero. Alternatively, the unit may be simplified for thisapplication by factoring out the tail bit logic from its implementation.

In certain applications it is desirable to transfer data tokens on bothfalling and rising edges of the synchronous clock, i.e., so-calleddouble data rate (DDR) applications. As long as the application callsfor an even number of data transfers per burst beginning on a risingedge of the clock, the only changes necessary to the burst mode A2Sconverter and burst mode S2A converters described above (e.g., in FIGS.58-61) are to the respective Datapath Transfer Units.

According to a specific embodiment, the DDR version of the A2S DatapathTransfer Unit can be specified as follows: A2S_DDR_DTU ==  CLK0 := 0; *[[CLK != CLK0];   [go -> L?R [ ] else -> skip],   CLK0 := CLK   ]The unit waits for a transition on CLK, and when “go” is asserted, itreads from its asynchronous input “L” to its synchronous output “R”.

According to a similar embodiment, the DDR version of the S2A DatapathTransfer Unit has the following specification: S2A_DDR_DTU ==  CLK0 :=0;  *[[CLK != CLK0];   [go -> R!L [ ] else -> skip],   CLK0 := CLK   ]The unit waits for a transition on CLK, and when “go” is asserted, itsends its synchronous input “L” to its asynchronous output channel “R”.

Circuit implementations of these DDR DTU variants are given in FIGS. 63and 64. Note that when burst mode DDR A2S and S2A converters (and theirfixed-length variants) are constructed using these datapath transferunits, the synchronous recipient or sender logic counts two tokens perclock cycle.

When using the variable-length burst designs (with tail bit control),the tail bit, like the synchronous handshake control signals S^(i) andS^(o), remains a single-data rate signal. Each tail bit value applies tothe pair of data tokens transferred on that cycle.

According to yet other embodiments which will now be described withreference to FIGS. 65-69, A2S and S2A conversion circuits are used toimplement a DDR-SDRAM interface. According to a specific embodiment,extensions to the circuits described above make such an implementationpossible. These extensions include a master/slave converter system,which allows the conversion of different types of information to belinked, and a nop-counter, which can give increased performance whenthere are complex constraints on the minimum spacing between data items.

In one such embodiment, the SDRAM interface uses a master/slave designin which multiple slave converters are controlled by commands sentthrough a “master” A2S converter. Basically, the control process of eachof the slave converters is replaced with a shared mechanism thatgenerates “go” signals for each. Based on the command word transferred,the system may also trigger one or more of the slave converters,possibly after some number of clock cycles of delay.

As described here, the master converter is A2S. However, it will beunderstood that a similar system could be designed with a master S2Aconverter and still remain within the scope of the invention.

The DDR-SDRAM protocol specifies that data are transferred (read orwrite) in a continuous burst starting a specified number of clock cyclesafter the corresponding read or write command. Hence the asynchronousside must ensure that data are available for writing, or empty bufferspace available for reading, before issuing the associated read or writecommand. This requires that the converters for commands and data belinked.

A slave A2S or S2A converter comprises a normal converter (such as anyof those described above) with its control process removed. According tovarious embodiments, such normal converters may comprise, for example,single-word converters (e.g., FIG. 65), burst converters (e.g., FIGS.66A and 66B), or fixed burst converters (e.g., FIGS. 67A and 67B). Suchconverters may also be double data rate (DDR) converters, but are notrequired to be so.

Deleting the control process leaves the slave converter with an inputsignal “go” and an output completion channel Ac. (For embodimentsdescribed above, the completion channel was called Ac for single-wordconverters and Bc for burst-mode converters). These channels will bereferred to below as Ac for simplicity.

According to a particular implementation, a slave converter does notitself perform a synchronous handshake. Instead, it simply transfersdata on every rising clock edge on which its input signal “go” isasserted. It is the responsibility of the larger system to satisfy thesame conditions as are placed on the control unit of a standaloneconverter, i.e., to wait for a token on Ac, and perform any necessarysynchronous flow control, before asserting the “go” signal. In a slaveS2A converter, as described above with reference to a standalone S2Aconverter, the Ac channel is initialized with tokens to match the amountof datapath buffering.

The general organization of an exemplary master/slave converter systemis shown in FIG. 68. Before a command is sent to master A2S converter6802, it passes through a control block MASTER_COMPLETE 6804 whichchecks that the necessary slave converters are ready. According to oneembodiment, MASTER_COMPLETE 6804 executes the following operation,specified in pseudocode, for every command: L?command; for each slaveconverter S,  if command requires a transfer on S,   Ac[S]?;  // receivecompletion token from S R!commandOnce the command emerges from MASTER_COMPLETE 6804, it is passed throughstandalone A2S converter 6802 (the “master” converter).

On the synchronous side, a SLAVE_TRIGGER unit 6806 is responsible forraising the “go” signals of the appropriate slave converters at theappropriate times, depending on the command. A simple version ofSLAVE_TRIGGER 6806 could observe the output channel C from master A2Sconverter 6802. On each rising clock edge, if C is valid (C.v and C.eboth high) and the command C.d indicates a slave transfer, then thecorresponding slave converter is triggered through a delay line. In aparticular embodiment, the delay in each delay line is programmable andcorresponds to an integer number of clock periods. In general,SLAVE_TRIGGER 6806 may be more complex including, for example,synchronous handshaking on the slave converters or other forms ofsynchronous control.

According to some embodiments, slave converters 6808 and 6810 have moredatapath buffering than their standalone counterparts. That is, in placeof the control process of the standalone converter, with its relativelysmall latency from Ac to “go”, the control latency of the slaveconverter passes through MASTER_COMPLETE 6804, master A2S 6802, andSLAVE_TRIGGER 6806 (with the associated delays). Therefore, the datapathbuffering of the slave converter is increased to match this greaterlatency. The number of initial tokens on the Ac channel of slave S2Aconverter 6808 (representing initial empty buffer space) may beincreased accordingly.

According to various embodiment, the NOP_COUNTER is a synchronous unitthe responsibility of which is to ensure that items sent through an A2Sconverter are separated by at least a minimum number of clock cycles.The number is given with each item, and specifies the minimum number ofskipped cycles between that item and the previous one.

The DDR-SDRAM protocol has numerous requirements on the timing ofcommands which, for particular implementations, can all be expressed asminimums: before a certain command can be issued, there must have beenat least a minimum number of cycles skipped (issuing null commands, orNOPs) since the previous command.

According to one embodiment, the required number of NOPs may begenerated on the asynchronous side and passed through the A2S. Accordingto such an implementation, it would then merely be necessary for thesynchronous side to generate additional NOPs when no command wasavailable. The disadvantage of this approach is that it may addunnecessary delay between commands that are already widely separated.The minimum number of NOPs is not known until the following command isknown, so passing those explicit NOPs through the A2S before thefollowing command would add extra delay even though more than enoughdelay may have already passed.

Referring now to FIG. 69, NOP_COUNTER 6902 is a synchronous blockattached to the output of A2S 6904. Its input and output each comprisesynchronous handshake channels. The input channel carries items(commands) with an associated minimum NOP count, and the output channelsends out those same items spaced by the required number of cycles. Onepossible CSP specification of this unit is the following: count :=lcount := rcount := 0; has_l := has_r := false; L.e := R.v := false;*[[˜CLK];  [ has_r & (count >= rcount) -> R.d:=r, R.v:=true  [ ] else     -> R.v:=false  ],  (L.e := ˜has_1);  [CLK];  [ R.v & R.e ->count:=0, has_r:=false  [ ] else  -> count:=count+1  ],  [ L.v & L.e ->(l,lcount):=L.d, has_l:=true  [ ] else  -> skip  ];  [ has_l & ˜has_r ->has_l:=false, has_r:=true, (r,rcount):=(l,lcount)  [ ] else   -> skip ]; ]In this program, the variable “count” holds the number of cycles sincethe last output on R. The pair (l,lcount) holds the input data andassociated minimum NOP count; this is copied to (r,rcount) for output.The Booleans has_l and has_r indicate when each of these pairs holdvalid tokens. Having two variable pairs allows the unit to input andoutput on the same clock cycle.

On each falling clock edge, NOP_COUNTER 6902 sets its output signals.When there is a token in r (has_r high), and the number of cycles sincethe last output is greater than rcount (count>=rcount), it sets R.d andR.v to send the value r; otherwise, it does not send. Also, if there isno token in l (has_l low), it raises L.e to enable input.

Data are transferred on the rising clock edge. If there is an output onR(R.v and R.e high), then the token is removed from r, and the count ofcycles since the last output is reset to 0; otherwise, the count isincremented. If there is an input on L, the data and nop-count are readinto a token in (l,lcount). Finally, if there is now a token in l butnot one in r, the token is transferred from l to r.

The SDRAM interface uses a NOP_COUNTER in conjunction with themaster/slave design above. The NOP_COUNTER is attached to the masterconverter and may be considered part of the master converter forpurposes of the rest of the design. In a particular implementation, theSLAVE_TRIGGER unit observes the output channel of the NOP_COUNTER asthough it were the output of the master converter. This keeps the timingof the slave converters consistent with the command stream that emergesfrom the NOP_COUNTER and is visible to the other synchronous circuitry.

According to further embodiments of the invention, system interconnectsolutions may be implemented using various combinations of variousimplementations of the circuits described above. According to one suchembodiment, a system interconnect is implemented to facilitateintegration of a system-on-a-chip (SOC) having a plurality of systemmodules each having its own clock domain. As will be seen, variousembodiments of the invention allow the high-speed interconnection ofdisparate clock domains while simultaneously eliminating issues relatingto synchronization of the various data rates. Even for systems which donot require high speed interconnection, the elimination of the clockingproblem make the solutions provided by the present invention veryattractive. In addition, because of the asynchronous nature of theinterconnect, the various synchronous devices in such systems may easilybe replaced by devices with higher data rates without impacting theinterconnect design.

An example of such a system 7000 is shown in FIG. 70. The interconnectportion of system 7000 includes four components that can be assembled asneeded to serve a variety of SOC interconnect applications. Asynchronouscrossbar 7002 is a compact and efficient multi-port non-blocking switchfabric that is used to interconnect multiple modules on chip withdifferent data rates transparently and with very modest overhead.According to one exemplary implementation, crossbar 7002 has a totalcapacity in excess of 1 Tbps, while maintaining a modest footprint andpower profile. A plurality of clock domain converters 7004 connect anyof synchronous modules 7006 to crossbar 7002 at any data rate from DC tothe maximum frequency of a crossbar port. According to a specificembodiment, each clock domain converter (CDC) includes one synchronousto asynchronous converter (S2A) and one asynchronous to synchronousconverter (A2S), each converter corresponding to a pre-defined data pathwidth including, for example, a 36-bit data path, a 4-bit destinationfield, and a tail bit. It will be understood, however, that the variousembodiments of the invention are not limited to any specific data path.

Once converted, the data are transported asynchronously through crossbar7002 and delivered to the clock domain converter associated with anothersynchronous module where the data are converted and delivered to thetarget module. Converters 7004 and crossbar 7002 provide seamlessconversion between two clock domains in addition to transport across alarge-scale SOC more efficiently than simple domain conversion in atypical synchronous SOC solution since in the case of a synchronous busthere may be two clock domain crossings. According to various exemplaryimplementations, the various asynchronous links in system 7000 may beimplemented using any 1ofN code. According to various specificembodiments, these links are implemented as 1of1, 1of2 or 1of4 raildomino logic.

As will be understood, the channels that link on-chip modules tocrossbar 7002 will vary in length as appropriate for a givenimplementation. Physically longer channels can have an adverse impact onthroughput because of the added delay attributable to the asynchronoushandshake. Therefore, for channels that span distances beyond a certainlength, e.g., more than 1-2 mm in a 0.13 um implementation, a pipelinedrepeater 7008 may be added to the channel to repeat the signal andmaintain a high data rate. Repeaters 7008 are used to regenerate theasynchronous signals on long wires and are typically installed at themid-point of a long bus between the clock domain converter and thecrossbar. According to some embodiments, more than one repeater may beused without degrading total throughput. Because repeaters 7008 areasynchronous in nature, they have very low forward latency and createthe effect of fine-grain pipelining. An exemplary circuit for such arepeater for use with specific embodiments of the invention is shown inFIG. 70A. It will be understood that a wide variety of circuits may beemployed for this function.

According to a specific embodiment, a built-in-self-test (BIST) module7010 is provided for facilitating testing of the interconnection portionof system 7000. In the embodiment shown, BIST module 7010 is disposed onthe data path 7012 between crossbar 7002 and a particular CDC 7004, andprovides pattern generation and checking for a selected datapath throughthe crossbar including one or more other CDCs 7004. According to aspecific embodiment, BIST module 7010 has a simple command/responseinterface allowing an external tester or scan chain to do a completefault and performance coverage of the system. The results can be usedfor speed binning, as well as for pass-fail analysis. It should be notedthat BIST module 7010 does not necessarily need to be positioned asshown in FIG. 70. That is, if any of the ports of crossbar 7002 areunused, BIST circuitry can occupy one of such ports by itself.

According to a particular implementation, a simple link layer protocolis employed to communicate via the interconnect of system 7000.According to this protocol, the sending module specifies a target moduleand sends the target a variable length block of data terminated by atail bit. Any contention between multiple modules trying to send to thesame destination is resolved by asynchronous crossbar 7002 using, forexample, the arbitration techniques described above. Various orderingrelationships, such as producer-consumer, are automatically preserved.According to specific embodiments, load completion ordering may also bepreserved if desired. Any of a wide variety of higher level protocols,such as load/store or send/receive may be mapped on top of the linklayer protocol as required. It will be understood that the interconnectsolution of the present invention may be scaled to fit a variety ofapplications, both in terms of number of ports, and in the number ofbits per port (i.e., port throughput).

According to a specific embodiment, crossbar 7002 provides connectivityfrom any one of its ports to any other of its ports at data rates up to36 Gbps per connection and direction using a 36-bit channel (32-bit dataand 4-bit parity) operating at 1 GHz in a 0.13 um CMOS implementation.According to other embodiments, higher speed are possible using higherdensity technology. Data rates up to 72 Gbps are supported for a 16-port72-bit implementation. The crossbar supports simultaneous (i.e.,non-blocking) data transmission between multiple modules with anefficiency that approaches the maximum per-port throughput of all blocksreceiving data. According to a more specific embodiment, the crossbaremploys a scattering algorithm to maintain fair access among the modulesto the interconnect and to each other.

The FIG. 71 shows the block diagram of a 16-port crossbar 7100, eachport comprising a plurality of bits, e.g., 36 or 72 bits. The crossbarincludes three elements; an input control unit 7102, an output controland arbitration unit 7104, and the crossbar circuit 7106 itself. Inputcontrol unit 7102 controls access to crossbar 7106. Output control andarbitration unit 7104 controls access to the output ports. Crossbar 7106performs the actual connections.

Each transaction going through crossbar 7106 comprises a data burst ofvariable length which is delimited by a tail bit flagging the lastelement of the burst. Each transaction includes and arbitration phaseand a data transfer phase. During the arbitration phase, the source CDCpresents the desired destination on the To bus and the data on theTxData bus. The content of the To bus is decoded and a separateindividual line is routed via one of 256 Request lines to output controland arbitration unit 7104. An arbiter for each output port arbitratesamong the different requesters and grants one request while maintainingthe other requests in a pending state.

The data transfer phase is started immediately after the arbitrationphase is completed. The data transfer phase continues until the lastdata of the burst have been transferred, the last data being marked bythe Tail bit (i.e., TxTail). Upon detection of the tail bit, the arbiterat the destination port will return to the arbitration phase.

Asynchronous arbiters are somewhat different from their synchronouscounterparts. That is, a typical synchronous arbiter is a finite statemachine which samples request lines on a clock edge, then makes adeterministic decision about what order to service requests. Anasynchronous arbiter must make similar decisions, but must also dealwith completely unaligned (“asynchronous”) transitions on the requestlines. This makes the asynchronous arbiter able to handle events atarbitrary times without reference to any clocks. In a particularimplementation of the interconnect system of the present invention,asynchronous arbiters are used to decide the ordering of bursts fromdifferent sources contending for the same target port. If the targetport isn't congested, the bursts will arrive in first-come first-serveorder. If the target port is congested, the senders requests will beserviced fairly.

FIG. 72 shows an asynchronous arbiter 7200 which may be employed withspecific embodiments of the present invention. Arbiter 7200 is atwo-input strictly fair arbiter. It is based upon the Seitz arbiter,which comprises cross-coupled NAND gates followed by a metastabilityfilter. Arbiter 7200 takes two rising events in[0] and in[1] andproduces mutually exclusive output rising events on out[0] or out[1]depending on which input arrived first. After the winner is selected,additional asynchronous circuitry is used to lower the winners inputbefore making another request. This gives the other input time to win ifit was waiting. If both inputs arrive at the same time, thecross-coupled NAND gates are metastable. Eventually, one or the otherwill win. The pass gate filter for the outputs guarantees that no outputevent will occur until the internal nodes have separated by at least aPMOS threshold, which indicates that the arbitration is over. In adelay-insensitive system, there is no danger of sampling the outputbefore the arbitration is complete. That is, the metastability filteradds an unpredictable delay, but guarantees that the final decision willbe stable and safe. If both inputs make requests as fast as they can,this arbiter will strictly alternate between the two.

According to a specific embodiment, multi-way arbitration may beaccomplished by cascading these 2-way fair arbitration cells into binarytrees. For example, a 16-way arbitration may be achieved using 15 2-wayarbiters in a binary tree to select the winner per output port. If theoutput is relatively uncongested, the requests ripple straight up thebinary tree to the root very quickly, and get serviced infirst-come-first-served order. On the other hand, if the port backs upand becomes congested, then each two-way arbiter in the tree willalternate between its two inputs. Thus, all requests are serviced fairlywithout starvation. For 2-way contention, such a tree arbiter is stillstrictly fair. It should be noted, however, that if three ports arecontending constantly, it is possible for one to get 50% and the othertwo 25% each, depending on where they are in the tree. The greatestpossible mismatch for a 16-way tree arbiter would be one port with 50%and 8 others with 6.25% each. This possibility is only significant forsustained long-term contention.

The arbitration circuit described in the previous paragraph and shown inFIG. 72 has the advantages of low latency and small area. Other morecomplex arbitration circuits such as a round-robin or weightedround-robin can also possibly be implemented.

If the priority of a particular synchronous module in the interconnectof the present invention is unclear, its behavior can be modifiedaccording to various embodiments to enforce a particular priority in thesystem. According to some such embodiments, the synchronous module isenabled to rate-throttle its own requests. That is, requests followingtoo quickly after a previous request are delayed. According to oneembodiment, this is achieved simply by running low priority modules atlower clock rates.

Alternatively, it may be achieved using a leaky bucket shapingalgorithm. For long term sustained congestion, this is an effective wayof regulating the priority of traffic through a crossbar. Indeed, in aswitch fabric system, it is often the responsibility of the trafficmanagers to shape different types of traffic appropriately such that thecrossbar itself need only arbitrate in a simple way. A leaky bucketalgorithm also has the advantage that the first request after a longwait will be made immediately without waiting. Only the streamingsustained data transfer is rate throttled.

On the other hand, purely rate throttling a sender has the drawback ofnot necessarily utilizing all the available bandwidth of the system. Forstrict real time flows or flows with fixed maximum data rates (such asslower I/O streams), this may not be a problem. However, if it is anissue, a module can be configured to actually sense the currentcongestion of the system by counting how often the ack line goes low.This indicates that a request was back-pressured due to downstreamcongestion. If this is detected, the module can enable rate throttlingfor a while. However, if the grant remains high, the module knows thatthe system is uncongested and it can continue making requests at itsmaximum possible rate.

FIG. 73 shows a simple throttling circuit 7302 installed between a clockdomain converter 7304 and the associated synchronous module 7306 whichis the originator of the transactions. Throttling circuit 7302 iscontrolled by two registers, a counter (the size of which isimplementation dependent), and a throttling status bit. The request fromthe device is either passed transparently if throttling is not active ordelayed if throttling is active. The Verilog listing for a particularimplementation of rate throttling circuit 7302 is provided below./****************************************************************************** *Throttler.V * * The following circuit implements a throttler. * * Input:ModReq: The request from the original module * Ack: The acknowledge fromthe SYNC/ASYNC circuit * Tail: The indication that this burst iscomplete * TrigLevel: The level at which the throttling is activated *Pause: The duration of the throttling * * Output: Count: The current‘delay’ counter * FwdReq: The request forwarded to the SYNC/ASYNCcircuit*****************************************************************************/module Throttler (ModReq,FwdReq,Ack,Tail,Clock,Count,TrigLevel,Pause);input ModReq, Clock, Ack, Tail; input [7:0] TrigLevel, Pause; outputFwdReq; output [7:0] Count; reg [7:0] Count; reg Throttling; /* Therequest is immediately forwarded if throttling is not active. */ assignFwdReq = ModReq * !Throttling; always @(posedge Clock)   begin   if (Throttling )     /* If throttling is active, then decrement currentcounter until       it reaches 0 and then deactivate the throttling */    begin     if ( Count == 0 )       begin       Throttling = 0;      Count = 0;       end     else       Count = Count − 1;      Throttling = 1;     end   else if ( ModReq && !Ack )     /* If notat the end of one transaction and there is request but no ACK,      indicating a delay in forwarding the request, then increment      counter up to a maximum */     begin     Throttling = 0;     if (Count < 255 )       Count = Count + 1;     end   else if ( ModReq && Ack&& Tail )      /* If at the end of the transaction, then check ifcounter exceed a       threshold, if yes, then activate throttling, ifnot decrement counter       by one to indicate that a delay of one clockis acceptable */     begin     if ( Count >= TrigLevel )       begin      Throttling = 1;       Count = Pause;       end     else if (Count > 0 )       begin       Count = Count−1;       Throttling = 0;      end     end   end    endmoduleAccording to another embodiment, this circuit may be enhanced to reportthrottling per destination allowing the originator module to forwardtransactions to free destinations and avoid the destinations that arebusy.

As discussed above, the clock domain converter (CDC) allowsinterconnection of a synchronous bus to the asynchronous crossbar.According to various embodiments, the CDC is a transaction-orientedinterface where each transaction includes a data burst of variablelength terminated by a tail bit. The transaction may be as short as oneword or as long as desired. For some embodiments, however, thetransaction length is kept relatively short to minimize the latencyduring contention at the destination.

A block diagram of an exemplary CDC 7400 for use with the interconnectof the present invention is provided in FIG. 74. Asynchronous-to-asynchronous (S2A) interface 7402 converts signals fromthe clock domain of synchronous module 7404 to the asynchronous domain.An asynchronous-to-synchronous (A2S) interface 7406 converts signalsfrom the asynchronous domain to module 7404's clock domain. Theseinterface may be implemented according to the various techniquesdescribed above.

CDC 7400 indicates that it is ready to receive a data word from module7404, i.e., <to[3:0], td[N:0], ttail>, by asserting an enable signal(not shown). Module 7404 indicates the presence of a valid data word byasserting a valid signal (not shown) and by driving the data word priorto the rising edge of the clock. A data transfer thus occurs when boththe enable and valid signals are asserted. The first word of thetransaction should include a valid destination port in the to[3:0] bus.The content of this bus is ignored for all subsequent words of thetransaction until a tail bit is detected.

According to a particular implementation, the CDC includes a loopbackfunction that is activated to facilitate the BIST function. When theloopback function is enabled during testing, the data coming from thecrossbar enters A2S interface 7406, is converted to a synchronous signalwhich is forwarded to module 7404 and S2A interface 7402 through amultiplexer 7408. Thus, during loopback, multiplexer 7408 routes thedata from A2S interface 7406 back to the crossbar instead of routing thedata from module 7404. Buffer 7410 swaps the from[3:0] and rd[3:0] lineson the first word of the transaction to force the transaction to go toanother CDC via the crossbar. Thus, from a single location in thesystem, all paths through the interconnect may be tested. In addition,each path may be tested at full rate by successive application of testvectors, thus allowing verification of backpressure mechanisms.

Operation of a specific embodiment of a built-in self test (BIST) modulefor use with various embodiments of the invention will now be describedwith reference to FIG. 75. As described above, BIST module 7502 isdesigned to provide access by external test equipment to the variousdata paths in the interconnect system 7500. According to thisembodiment, BIST module 7502 has two modes of operation: active andinactive. In the inactive state, BIST module 7502 is transparent andrelays transactions from an attached synchronous device 7504 to crossbar7506 (via CDC 7508), or vice versa, without any interpretation of thetransactions being passed.

In the active state, BIST module 7502 reroutes attached CDC interface7508 to its own internal data path, executes a test vector requested ona vector_in input, and presents the result of the test on a vector_outoutput once the test requested has been executed. According to aspecific embodiment, the structures of the vector_in and vector_outwords are as shown in FIGS. 76A and 76B, respectively. The format of thetransactions generated by BIST module 7502 according to this embodimentis shown in FIG. 76C.

The bits 0 to 3 of the nextnode field in FIG. 76C defines the nextdestination that the packet must take while bits 4 to 7 are reserved forfuture usage. The pattern is a replication of the 8-bit pattern definedin the input vector. FIG. 75 shows the path followed by an exemplarytransaction originated from BIST module 7502 and going through two nodes(i.e., CDC 7510 and CDC 7512).

Upon reception of an input vector from scan register 7514, BIST module7502 starts transmitting a first transaction to the first node, i.e.,CDC 7510, which forwards the transaction to CDC 7512, which thenforwards it back to BIST module 7502, which checks the integrity ofdata. As described above with reference to FIG. 74, the first and secondCDCs determine the next hop by reading bits 0-3 of the first word of thetransaction received, each of the CDCs also replacing the bits 0-3 withthe content of the from field received with the transaction whileleaving the subsequent words of the transaction unchanged. The swappingof the bits allow the transaction to fully test the data path.

Input and output scan registers (block 7514) are inserted in the datapath between module 7504 and CDC 7508. According to various embodiments,the input scan register is 43 bits for a 36-bit wide version, and 79bits for a 72-bit wide version, and is used to supply the input vectoras well as controlling the status of the CDCs and the BIST mode. Inaddition to the input vector, one bit of the input scan register maps toan enable bit for the BIST module, while another maps to an enable bitfor the loopback function described above.

The output scan register is 41 bits wide for a 36-bit wide version, and77 bits wide for a 72-bit version, and is used to recover the outputvector. In addition to the output vector, one bit of the output scanregister indicates if the BIST module is ready to received a new vector(implying the previous vector has been processed).

FIG. 77 shows a multi-processor system-on-a-chip 7700 employing aparticular implementation of the interconnect of the present invention.This design include a mixture of synchronous devices that act either asmaster devices, slave devices or both. A master device is a devicecapable of generating requests to other devices while a slave device isa device receiving and executing requests from master devices and, insome cases, transmitting a response to the originator.

Each of CPUs 7702 and 7704 is a master device which may generaterequests to read (load) from or write (store) to any of the otherdevices (except the other CPU). Such requests could come, for example,from the instruction or data cache units, or the register file of theCPU. By contrast, embedded RAM 7706 is a slave device which receivesload or store requests from any master devices, but never receivesrequests or responses from any other slave device. PCI Bus Controller7708 is both a master and a slave device which generates load or storerequests to slave devices, or receive load or store requests from masterdevices. DDR Controllers 7710 and 7712 are slave devices which receiveload or store requests from any master devices, but not from any otherslave device. Ethernet controller 7714 is both a master and a slavedevice which generates load or store to slave devices or may receivesuch requests from a master device.

According to a specific embodiment, the interconnect solution of FIG. 77may be implemented without changing the current interfaces of thevarious synchronous modules by using an adaptation layer (AL) (notshown) to convert the higher level transactions from the synchronousmodules into simple bursts compatible with the crossbar protocol. Adifferent burst type would be used to represent each of the one-way,unicast transactions such as memory load, memory store, cache snoop andcache invalidate. The overall transaction is accomplished in the AL byunrolling any multicast operations and collecting the expected results.According to a more specific embodiment, three types of AL are employed.A master AL is used by a master device, e.g., a CPU, to initiate alltypes of transactions. A cache AL is used by a CPU to respond to snoopand invalidate transactions. A target AL is used by a slave device,e.g., memory, to respond to load and store transactions. According to aparticular embodiment, each of the AL associated with a particulardevice employs its own interconnect port.

According to such an embodiment, CPUs 7702 and 7704 employ master andcache AL (not shown). Other devices, e.g., PCI controller 7708 orEthernet controller 7714, employ master and target AL. Memory, e.g.,memory controllers 7710 and 7712, employ target AL. The target AL simplypass through the bursts and may be omitted.

The master AL sends four types of requests, i.e., snoop, invalidate,load, and store. The snoop requests are transmitted to the cache AL ofall other CPUs. The completions are collected and the data (if any) arereturned. The invalidate requests are sent to the cache AL of all otherCPUs. The completions are collected and summarized. The load requestsare sent to the specified target AL and a completion is returned. Thestore request is sent to the specified target AL and nothing isreturned.

The cache AL receives snoop requests, looks up the line in theassociated cache, sends data and hit/miss/dirty information back in acompletion message. The cache AL also receives invalidate requests,looks up the line in the cache, and sends back the appropriatecompletion message.

Separation of these functions into separate AL on different interconnectports allows common parts of the system to be shared by the threedifferent types of synchronous devices in the system. In addition, usingseparate ports for initiating transactions and responding totransactions increases overall system performance, and greatlysimplifies system deadlock considerations. According to a particularembodiment, approach (along with the non-blocking crossbar) enablesdevices to keep making requests without concern for the amount ofbuffering at the target devices. That is, the system can rely on localflow control. The master AL must guarantee that any completions aredrained when they arrive, but the target and cache AL need no bufferingat all. Notwithstanding the foregoing, it should be noted that the AL ofeach device may share an interconnect port, but care must be taken toavoid mutual deadlock scenarios.

FIG. 78 is another example of a system-on-a-chip employing aninterconnect designed according to the present invention. SONETinterconnect switch 7800 includes a plurality of SONET interfaces 7802interconnected according to the present invention, thus facilitatingcommunication among external optical interfaces 7804.

While the invention has been particularly shown and described withreference to specific embodiments thereof, it will be understood bythose skilled in the art that changes in the form and details of thedisclosed embodiments may be made without departing from the spirit orscope of the invention. For example, various interconnect solutions havebeen described herein with reference to a particular asynchronous designstyle. However, it will be understood that any of a variety ofasynchronous domain types are within the scope of the invention.Moreover, the specific details of the circuits described herein aremerely exemplary and should not be considered as limiting the invention.Rather, any circuits implementing the basic functionality of thecircuits described herein are also within the scope of the invention. Inaddition, while some specific examples have been provided of systemswhich may employ the various interconnect solutions described herein, itwill be understood that the scope of the present invention encompassesany system that employs any of the various interconnection solutionsembodied by the invention.

Finally, although various advantages, aspects, and objects of thepresent invention have been discussed herein with reference to variousembodiments, it will be understood that the scope of the inventionshould not be limited by reference to such advantages, aspects, andobjects. Rather, the scope of the invention should be determined withreference to the appended claims.

1. An integrated circuit, comprising: a plurality of synchronousmodules, each synchronous module having an associated clock domaincharacterized by a data rate, the data rates comprising a plurality ofdifferent data rates; a plurality of clock domain converters, each clockdomain converter being coupled to a corresponding one of the synchronousmodules, and being operable to convert data between the clock domain ofthe corresponding synchronous module and an asynchronous domaincharacterized by transmission of data according to an asynchronoushandshake protocol; and an asynchronous crossbar coupled to theplurality of clock domain converters, and operable in the asynchronousdomain to implement a first-in-first-out (FIFO) channel between any twoof the clock domain converters, thereby facilitating communicationbetween any two of the synchronous modules.
 2. The integrated circuit ofclaim 1 wherein the asynchronous crossbar is operable to route the datafrom any of a first number of input channels to any of a second numberof output channels according to routing control information, eachcombination of an input channel and an output channel comprising one ofa plurality of links, the crossbar being operable to route the data in adeterministic manner on each of the links thereby preserving a partialordering represented by the routing control information, wherein eventson different links are uncorrelated.
 3. The integrated circuit of claim2 wherein the crossbar is operable to transfer the data on at least oneof the links based on at least one timing assumption.
 4. The integratedcircuit of claim 3 wherein the at least one timing assumption comprisesany of a pulse timing assumption, an interference timing assumption, andan implied-data-neutrality timing assumption.
 5. The integrated circuitof claim 1 wherein the asynchronous handshake protocol between a firstsender and a first receiver comprises: the first sender sets a datasignal valid when an enable signal from the first receiver goes high;the first receiver lowers the enable signal upon receiving the validdata signal; the first sender sets the data signal neutral uponreceiving the low enable signal; and the first receiver raises theenable signal upon receiving the neutral data signal.
 6. The integratedcircuit of claim 1 wherein the asynchronous handshake protocol isdelay-insensitive.
 7. The integrated circuit of claim 1 furthercomprising at least one repeater, each repeater being coupled between aselected one of the clock domain converters and the asynchronouscrossbar.
 8. The integrated circuit of claim 7 wherein each repeatercomprises an asynchronous half-buffer circuit.
 9. The integrated circuitof claim 1 wherein the asynchronous crossbar is operable to arbitrateamong multiple requests corresponding to a same destination synchronousmodule.
 10. The integrated circuit of claim 9 wherein the asynchronouscrossbar comprises arbitration circuitry to effect arbitration, thearbitration circuitry comprising at least one Seitz arbiter.
 11. Theintegrated circuit of claim 1 further comprising a rate throttlingcircuit associated with a specific one of the synchronous modules whichis operable to control transmission of the data to the correspondingclock domain converter.
 12. The integrated circuit of claim 11 whereinthe rate throttling circuit is operable to control transmission of thedata by delaying transmission of the data in accordance with a priorityassociated with the specific synchronous module.
 13. The integratedcircuit of claim 11 wherein the rate throttling circuit is operable tocontrol transmission of the data in response to congestion between thecorresponding clock domain converter and the asynchronous crossbar. 14.The integrated circuit of claim 13 wherein the rate throttling circuitis operable to determine the congestion with reference to theasynchronous handshake protocol between the corresponding clock domainconverter and the asynchronous crossbar.
 15. The integrated circuit ofclaim 1 further comprising a built-in-self-test (BIST) module betweenone of the clock domain converters and the asynchronous crossbar, theBIST module being operable to transmit test vectors to and receiveresult vectors from each of the synchronous modules via the asynchronouscrossbar.
 16. The integrated circuit of claim 15 wherein the BIST modulecomprises a scan register for receiving the test vectors from externaltest equipment.
 17. The integrated circuit of claim 15 wherein the BISTmodule is operable to transmit a first test vector to a first one of thesynchronous modules via the asynchronous crossbar, the first synchronousmodule being operable to transmit a first result vector corresponding tothe first test vector to a second one of the synchronous modules via theasynchronous crossbar, the second synchronous module being operable totransmit a second result vector to the BIST module via the asynchronouscrossbar, the BIST module being further operable to verify the secondresult vector.
 18. The integrated circuit of claim 1 wherein theintegrated circuit comprises a multi-processor system, the plurality ofsynchronous modules comprising at least two central processing unitswith associated cache memory, at least one memory controller, at leastone internal peripheral device, and at least one I/O interface.
 19. Theintegrated circuit of claim 18 further comprising at least one repeater,each repeater being coupled between a selected one of the clock domainconverters and the asynchronous crossbar.
 20. The integrated circuit ofclaim 1 wherein the integrated circuit comprises a synchronous opticalnetwork (SONET) interconnect switch, the plurality of synchronousmodules comprising a plurality of SONET interfaces.
 21. The integratedcircuit of claim 20 further comprising at least one repeater, eachrepeater being coupled between a selected one of the clock domainconverters and the asynchronous crossbar.
 22. At least onecomputer-readable medium having data structures stored thereinrepresentative of the integrated circuit of claim
 1. 23. The at leastone computer-readable medium of claim 22 wherein the data structurescomprise a simulatable representation of the integrated circuit.
 24. Theat least one computer-readable medium of claim 23 wherein thesimulatable representation comprises a netlist.
 25. The at least onecomputer-readable medium of claim 22 wherein the data structurescomprise a code description of the integrated circuit.
 26. The at leastone computer-readable medium of claim 25 wherein the code descriptioncorresponds to a hardware description language.
 27. A set ofsemiconductor processing masks representing the integrated circuit ofclaim 1.